PRS Technologies, Inc. - Update Statistics

Informix Tech Notes article on Update Statistics

Developing an Update Statistics Utility
by Rick Baker

Overview

This article contains the following information:

Introduction
The Problem: Too Many Choices
The Solution: Let a Program Do It
The Programs
Conclusion
Acknowledgment
About the Author

Introduction

If you have a database that is larger than a few megabytes in size, you quickly learn that updating the statistics for the database is absolutely essential to performance. Without the information generated by the UPDATE STATISTICS command, the optimizer has practically no information with which to develop a query plan. The result is a horrendous number of sequential scans and terrible performance. Updating the database statistics is the single most important thing you can do to get the database to function efficiently. Other tuning efforts can improve (or worsen) performance by a few percentage points; updating statistics will improve performance by orders of magnitude.

But you probably already know this. The purpose of this article is not to describe what the UPDATE STATISTICS command does, or to quantify the impact. It is rather to address the problem of how to go about doing it in an efficient manner—efficient for both the database and the DBA.

The Problem: Too Many Choices

You can find the procedure recommended by Informix for updating statistics in the Performance Guide for Informix Dynamic Server, Version 7.3, February 1998. The discussion starts on page 10-6. The recommendation for doing a thorough statistical update is as follows:

Run UPDATE STATISTICS MEDIUM for all columns in a table which do not head an index. The default parameters are sufficient unless the table is very large, in which case you should use a resolution of 1.0, 0.99. Use one command per table.
Based on the sampling techniques used by the UPDATE STATISTICS command, any table with more than 26,570 rows is considered a "very large" table. You may wish to adjust the value used in your installation after testing other values.
Run UPDATE STATISTICS HIGH for all columns that head an index or differentiate an index. Use one command for each column.
For each multicolumn index, run UPDATE STATISTICS LOW for all of the columns in the index. Use one command per table.
For small tables, run UPDATE STATISTICS HIGH for the table.

This, of course, is the general recommendation. You may find, if you have time to do extensive testing, that a different series of statistics will give better performance for your installation.

In any case, the problem for the DBA is generating all the statements needed to update the statistics for an entire database. In our situation, we have a database with 982 tables and 1,465 indexes. It requires 1,903 UPDATE STATISTICS commands to comply with the recommendations above. If you have a very stable system, you could build all these commands once and then run them as often as necessary. However, if you frequently change the structure of the database by adding or dropping tables and indexes, you would have to modify your massive SQL every time you made a change. Just what you needed—more work!

There is also the problem of run time. It would be nice if updating the statistics did not take too long. But it does if the system is of any size, especially the "high" commands on very large tables. One way to speed up the process is to break the string of SQL statements into multiple files and run the files simultaneously. When I first started using this approach, it cut the run time on our system from about 7 hours to about 3 hours. But there is additional complexity here also: each of the command files should be balanced so that, ideally, they all complete at about the same time. They should also be balanced across the dbspaces to minimize I/O contention. How do you determine which commands to put in which files in order to achive this balance?

The Solution: Let a Program Do It

The solution presented here is a shell script, an awk program, and an Informix® 4GL program which take care of all of the above complexities for you. They do not produce a perfect solution (and I will note the imperfections below), but the solution is much quicker to obtain and much more flexible than doing the job manually.

In order to create and run multiple SQL command files simultaneously, I took the following approach:

A 4GL program determines the number of command files to use (called "streams" in the discussion and program below), and which commands to put in which file.
It then writes a single output file containing all the SQL commands, but with a file number at the beginning of each line.
An awk program separates this output file into the individual SQL command files.
The shell script then executes each command file simultaneously using dbaccess.

(An easy modification to the 4GL program would be to write the commands to a temporary table instead of a single output file, and then sort the table in stream order and write the individual command files directly from the 4GL program. This would eliminate the awk program.)

The "brain" of the system is the 4GL program. This program reads the system catalog tables to get the necessary information for every table and index in a database.

Three options are available to control the number of streams generated. The first option, and simplest, is to add no qualifiers to the command that executes the program. This will place all UPDATE STATISTICS commands in a single file. The second choice is "automatic" stream generation. One stream or command file will be created for each dbspace that contains a table. All the commands for all the tables in one dbspace will be in one command file. The third option is to specify a fixed number of streams. In this case, the program will optimize the assignment of the commands to the streams based on the balancing criteria discussed above.

If there is any "magic" in this program, it is in the optimization function called when using a fixed number of streams. A cost for each UPDATE STATISTICS command is calculated based on the number of rows in the table using a formula derived from testing on my system. The parameters of the formula may not be accurate on your system, but they provide a starting point.

The program collects all the information needed to create the SQL commands for each table, including the cost, and stores it in temp tables. When all the database tables have been processed, the program reads the temp tables and writes the actual commands to an output file.

The Programs

Three programs are defined to implement this system: a 4GL program to create the actual UPDATE STATISTICS commands, an awk program to split the list of SQL commands into multiple files, and a shell script which runs the first two programs and then executes the SQL command files.

updstat.sh
This is the shell script that controls the whole process:


#!/usr/bin/ksh
#
# updstat.sh - Update statistics for a database
#
#		Runs updstat.4ge to generate a list of sql statements that
#		will properly update the statistics for the specified
#		database. Splits these statements into one or more command
#		files, then runs all the files simultaneously.
#
#		ARGUMENTS:
#			-d dbname	database name (required)
#			<n>			number of simultaneous streams to generate
#			-a			auto-generate streams
#
# Determine if this is a single stream or multistream run
#
if [ $# -eq 3 ]
then
		single_stream=Y
else
		single_stream=N
fi
# Get the database name from the arguments
#
arglist=""
while [ $# -gt 0 ]
do
	case $1 in
		-d)	dbname=$2
			arglist="$arglist -d $dbname"
			shift
			shift
			;;
		*)	arglist="$arglist $1"
			shift
			;;
	esac
done
#
# Set environment
#
INFORMIXDIR=/usr/informix
ONCONFIG=onconfig.bp2
INFORMIXSERVER=bright2
INFORMIXSQLHOSTS=/etc/sqlhosts
PATH=$INFORMIXDIR/bin:$PATH
export INFORMIXDIR ONCONFIG INFORMIXSERVER INFORMIXSQLHOSTS PATH
# Move to work directory and clean up from previous runs
#
progdir=`pwd`
cd /work/informix/updstat
if [ -d $dbname ]
then
	cd $dbname
	rm -f *
else
	mkdir $dbname
	cd $dbname
fi
# Build the list of sql commands
# (If generating a single stream, the output file will be
# "updstat.<dbname>.sql". If generating multiple streams,
# the output file will be "updstat.<dbname>.tmp".)
#
$progdir/updstat.4ge $arglist
[ $? -gt 0 ] && exit
if [ $single_stream = "Y"]
then
 #
 # Run single SQL file
 #
 dbaccess $dbname updstat.$dbname.sql > updstat.$dbname.out 2>&1
else
#
 # Split commands into multiple command files
 # (The command files will be named
 # "updstat.<dbname>.<nnn>.sql" where <nnn> is a sequence
 # number, plus one additional file named
 # "updstat.<dbname>.PROC.sql".)
 #
 nawk -f $progdir/updstat.awk -v dbname=$dbname
 updstat.$dbname.tmp
 #
 # Run the sql command files simultaneously
 #
 for file in `ls updstat.$dbname.???.sql`
 do
		outfile=`echo $file | sed 's/sql/out/'`
		dbaccess $dbname $file > $outfile 2>&1 &
 done
 wait
 #
 # Run the command file for stored procedures last
 #
 dbaccess $dbname updstat.$dbname.PROC.sql >
 updstat.$dbname.PROC.out 2>&1
fi

Be careful of the number of simultaneous streams that you generate. There may be a limit defined in the UNIX kernel on the number of processes that can be generated by one user. Check with your UNIX system administrator if you get an error indicating too many processes.

The command line parameters of this script are passed to the 4GL program. Within the designated work directory, a subdirectory will be created (if it does not already exist) where all the work files associated with the database being processed are to be placed. This subdirectory will have the same name as the database.

updstat.4ge
The following is the 4GL program that generates the SQL UPDATE STATISTICS commands. In addition to the arguments accepted by the shell script, the 4GL program will also accept an argument of -t <table name>. This facilitates generating the commands for a single table when needed; for instance, after unloading and reloading a table.

The code is divided into three modules: globals.4gl, main.4gl, and build.4gl. globals.4gl contains definitions of the variables used globally throughout the program. main.4gl controls the flow of the processing. build.4gl contains the functions that gather the required information and then create the SQL statements. There are a few comments at the beginning of each module.

globals.4gl
There are some arrays defined here that are sized to handle a fairly large database, but they will not handle every situation. You will need to check the following array sizes to make sure they accommodate your system:


g_colname	maximum number of columns in a table
g_hi_list	maximum number of columns in a table needing HIGH update
g_med_list	maximum number of columns in a table needing MEDIUM update
g_low_list	maximum number of columns in a table needing LOW update
g_streams	maximum number of streams to be generated

globals
	define
		scratch		char(2048),	# Work area
		g_single		smallint,	# Single table?
		g_auto		smallint,	# Auto-generate streams?
		g_tabid		integer,
		g_tabname		char(18),
		g_nrows		integer,
		g_dbspace		char(3),		# Dbspace (taken from
								# partnum)
		g_colname array[500] of char(18),
		g_numcols		smallint,

		g_idxrec array[100] of record
			col1		smallint,
			col2		smallint,
			col3		smallint,
			col4		smallint,
			col5		smallint,
			col6		smallint,
			col7		smallint,
			col8		smallint,
			col9		smallint,
			col10		smallint,
			col11		smallint,
			col12		smallint,
			col13		smallint,
			col14		smallint,
			col15		smallint,
			col16		smallint
		end record,
		g_numidx		smallint,
		g_hi_list	array[500] of smallint,
		g_med_list	array[500] of smallint,
		g_low_list	array[500] of smallint,
		g_numhi		smallint,
		g_nummed		smallint,
		g_numlow		smallint,
		g_stream array[200] of record
			name		char(3),
			cost		float
		end record,
		g_numstrm		smallint,	# Number of output streams
		g_currstrm		smallint		# Current output file number
end globals

main.4gl
Small tables are defined in function process() as those with less than 1000 rows. This is an arbitrary choice on my part. I have not done any testing to determine the optimum point at which a table should be considered "small."

The last line written to the output file is an UPDATE STATISTICS for procedure command. Statistics for procedures are updated after statistics for all tables are updated so that the procedures are optimized using the latest statistical data for any tables that are referenced.


# updstat - Generate update statistics SQL files to update
# the statistics for a table or an entire database.
#
# Written by Rick Baker
# iirab@brightpoint.com
# Brightpoint, Inc.
# Indianapolis, IN
#
# 	Must be user informix or root to run.
# 	Written to follow the guidelines found in "Performance
# 	Guide for Informix Dynamic Server", Version 7.3, February
# 	1998:
#
# 		update small tables HIGH.
# 		update MEDIUM DISTRIBUTIONS ONLY all columns that
# 		do not head or differentiate an index. Use resolution 1.0,
# 		  .99 for very large tables. Use one command per table.
# 		update HIGH all columns that head or differentiate an
# 		  index. Use one command for each column.
# 		update LOW for all columns in a multicolumn index.

# ARGUMENTS:
# 	-d <dbname>	database name (required)
# 	-t <tabname>	table name (optional)
# 	<n>			number of SQL files to write (optional)
# 	-a			(auto) create one SQL file for each
# 				  dbspace containing tables or indexes
#
# DISCUSSION
#
# Three different methods of processing the tables are provided.
#
# 1. Sequential processing. All tables are processed sequentially
#	in a single SQL command file.
#
# 2. Multi-stream processing. <n> SQL command files are created.
#	Each table will require one or more UPDATE STATISTICS
#	commands. All the commands for a table are sent to one command
#	file to avoid multiple commands being run against the same
#	table simultaneously. An attempt is made to balance the workload
#	for each command file by balancing the "cost" for each table.
#	"Cost" is based on the number of rows in the table and the
#	number of columns which must have UPDATE STATISTICS HIGH run
#	against them.
#
# 3. Auto-stream generation. An SQL command file is created for
#	each dbspace which contains tables. The SQL commands for
#	each table are written to the command file for the dbspace
#	which contains the table.
#
# All output from this program is actually written to a single
# file. Before each UPDATE STATISTICS command is written to this
# file, it is prepended with the number of the SQL command file.
# The output file must be read by an awk program and split into
# the actual SQL files.
#
# MODIFICATION HISTORY
#
# 1/14/99	rab	Change auto-stream to put all commands for one
#			table in the same command file.
#			Put the "procedures" command in stream 0 and modify
#			the awk program and shell script to run it after
#			all other streams have completed.
# 1/25/99	rab	Change logic for multi-stream: for each table,
#			evaluate the cost and store cost and commands in
#			temp tables. Select from the temp tables in order
#			of cost and assign to streams. (Almost 50% reduction
#			in overall run time.)
#
# 3/24/99	rab	Add -t option to process a single table.
#
# 8/19/99	rab	Use 7.3 recommendations.
#			Add nrows to temp table 2 so it is available at
#			write time.
##################################################################
##
database sysmaster
globals "globals.4gl"
####################################################################
main
####################################################################
	call init()
	call process()
	call finish()
end main
####################################################################
function init()
####################################################################
#
# This function:
#	Gets command line arguments:
#		database name (required)
#		optional arguments
#	Sets the database
#	Sets lock mode
#	Sets read mode
#	Calls function to create temp tables
#	Calls function to initialize the cursors
#	Calls function to initialize the cursors in build.4gl
#	Starts the report
#
	define
		i			smallint,
		dbname		char(18)
	# Set database and selection criteria
	let dbname = null
	let g_single = false
	let g_auto = false
	let g_numstrm = 1
	for i = 1 to num_args()
		case
			when arg_val(i) = "-d"
				let i = i + 1
				let dbname = arg_val(i)
			when arg_val(i) = "-t"
				let i = i + 1
				let g_tabname = arg_val(i)
				let g_single = true
				when arg_val(i) = "-a"
					let g_auto = true
					let g_numstrm = 0
				otherwise
					if arg_val(i) >= 1 and arg_val(i) <= 999
					then
						let g_numstrm = arg_val(i)
				else
						let scratch =
							"Invalid argument: '",
							arg_val(i) clipped,
							"'. Aborting..."
						display scratch
					exit program (1)
				end if
			end case
		end for
		if dbname is null
		then
			display "No database name specified. Aborting..."
			exit program (1)
		end if
		close database
		database dbname
		# Set parameters for reading the system tables
		set lock mode to wait 30
		whenever error continue
		set isolation to dirty read
		whenever error stop

		# Create temp tables
		call create_temp()
		# Prepare cursors
		call init_cursors()
		call init_tmp_cursors()
		# Check for multiple streams
		if g_numstrm > 1
		then
			call init_streamlist()
		end if
		# Start the report
		if g_single
		then
			let scratch =
				"updstat.",
				g_tabname clipped,
				".sql"
		else
			let scratch =
				"updstat.",
				dbname clipped,
				".tmp"
		end if
		start report report1 to scratch
end function
# init()

##################################################################
function create_temp()
##################################################################
#
# This function creates the temp tables.
#
	create temp table updstat1
	(
		level		char(1),
		tabname	char(18),
		colname	char(18)
	) with no log
create temp table updstat2
	(
		tabname	char(18),
		dbspace	char(3),
		cost		float,
		nrows		integer
	) with no log
end function
# create_temp()
##################################################################
function init_cursors()
##################################################################
#
# This function defines all the cursors used in this module.
#
	 let scratch = "select a.tabid, a.tabname, a.nrows,",
" hex(a.partnum), hex(b.partn)",
" from systables a, outer sysfragments b",
" where a.tabtype = 'T'",
" and a.tabid = b.tabid",
" and b.fragtype = 'T'"
	if g_single
	then
		let scratch = scratch clipped,
" and a.tabname = '",
g_tabname clipped,
"'"

else
		let scratch = scratch clipped,
" order by tabid"
	end if
	prepare systables_spec from scratch
	declare systables_curs cursor for systables_spec
	let scratch = "select colno, colname",
" from syscolumns",
" where tabid = ?",
" order by colno"
	prepare syscolname_spec from scratch
	declare syscolname_curs cursor for syscolname_spec
	let scratch = "select part1, part2, part3, part4,",
" part5, part6, part7, part8,",
" part9, part10, part11, part12,",
" part13, part14, part15, part16",
" from sysindexes",
" where tabid = ?",
" order by part1, part2, part3, part4,",
" part5, part6, part7, part8"
	prepare sysindexes_spec from scratch
	declare sysindexes_curs cursor for sysindexes_spec
end function
# init_cursors()

##################################################################
function init_streamlist()
##################################################################
#
# This function initializes the stream list. If a specified number
# of streams are to be used, the cost field is set to zero for
# the requested number of elements.
#
	define
		i	smallint
		for i = 1 to g_numstrm
		   let g_stream[i].cost = 0
		end for
end function
# init_streamlist()
##################################################################
function process()
##################################################################
#
# This function loops through each table in the database. It calls
# functions to get the column name and index data, identify which
# columns need to be specified in which SQL statements, determine
# the SQL file, and save the data in the temp tables. After all
# tables have been processed, a function is called to write the
# SQL command file.
#
	define
		partnum_p		char(10),	# Partnum from systables
		partnum_f		char(10),	# Partnum from sysfragments
		prev_tabid		integer
	let prev_tabid = 0

	# Loop through each table
	foreach systables_curs
	into g_tabid, g_tabname, g_nrows, partnum_p, partnum_f
		# Don't repeat a table
		if g_tabid = prev_tabid
		then
			continue foreach
		end if
		# Set the partnum
		if partnum_p = 0
		then
			let g_dbspace = partnum_f[3,5]
		else
			let g_dbspace = partnum_p[3,5]
		end if
		# Process small tables separately
		if g_nrows <= 1000
		then
			call small_table()
			continue foreach
		end if

		# Load the column names
		call get_colnames()
		# Load the index definitions
		call get_indexes()
		# Build the column lists
		call build_lists()
		# Save the data in the temp tables
		call save_data()
		# Save the tabid
		let prev_tabid = g_tabid
	end foreach
	# Write the SQL statements
	call write_stmts()
end function
# process()

##################################################################
function get_colnames()
##################################################################
#
# This function loads the array g_colname with the names of each
# column in the current table.
#
	define
		colno 	smallint
	let g_numcols = 1
	foreach syscolname_curs
		using g_tabid
		into colno, g_colname[g_numcols]
		let g_numcols = g_numcols + 1
	end foreach
	let g_numcols = g_numcols - 1
end function
# get_colnames()

##################################################################
function get_indexes()
###################################################################
# This function loads the array g_idxrec with the definition of
# each index defined for the current table.
#
	let g_numidx = 1
	foreach sysindexes_curs
		using g_tabid
		into g_idxrec[g_numidx].col1,
g_idxrec[g_numidx].col2,
g_idxrec[g_numidx].col3,
g_idxrec[g_numidx].col4,
g_idxrec[g_numidx].col5,
g_idxrec[g_numidx].col6,
g_idxrec[g_numidx].col7,
g_idxrec[g_numidx].col8,
g_idxrec[g_numidx].col9,
g_idxrec[g_numidx].col10,
g_idxrec[g_numidx].col11,
g_idxrec[g_numidx].col12,
g_idxrec[g_numidx].col13,
g_idxrec[g_numidx].col14,
g_idxrec[g_numidx].col15,
g_idxrec[g_numidx].col16
		# Increment the index to this array
		let g_numidx = g_numidx + 1
	end foreach
	let g_numidx = g_numidx - 1
end function
# get_indexes()

##################################################################
function finish()
###################################################################
# This function updates statistics for all procedures and then
# closes the report file.
#
		if not g_single
		then
			# Update statistics for all procedures
			let g_currstrm = 0
			let scratch = "update statistics for procedure;"
			output to report report1()
		end if
		finish report report1
end function
# finish()

build.4gl
As mentioned previously, allocating commands to one or more command files can follow one of three methods. The simple method is a single stream. In the auto-stream method, one command stream is created for each dbspace that contains a table. All the commands for all the tables that reside in one dbspace are put in the same stream. The logic behind this is that the UPDATE STATISTICS command reads from the dbspace where the table is located. Putting commands for tables in the same dbspace in the same stream eliminates I/O contention in each dbspace. In addition, if the space used in the dbspaces is reasonably well balanced, the time to execute each of the streams should be similarly balanced. (However, this reasoning does not consider row size. The number of rows in a table is much more important than the total size of the table in determining the time it takes to execute a command.)

The third method uses a fixed number of streams. The number of streams to be used is input by the user on the command line. In this case, the "cost" of executing each command is estimated and the commands are placed into the streams so that the total cost is balanced across the streams. All of the commands for one table are placed together in the same stream in order to prevent two different streams from trying to read the same table at the same time. This method does not take into account the dbspace in which the tables reside. If two tables from the same dbspace were being processed at the same time, there could be some I/O contention.

The "cost" of executing each UPDATE STATISTICS command is computed in formulas that were determined from testing on our system. I timed the commands as they ran against tables with varying numbers of rows, then determined a formula which approximates the data observed. The results will vary on other systems. If you want to have very accurate cost formulas, you will need to do testing on your system and modify the formulas appropriately.


globals "globals.4gl"
##################################################################
function init_tmp_cursors()
###################################################################
# This function creates cursors to be used with the temp tables.
#
	# Prepare insert statements
	let scratch = "insert into updstat1 values(?, ?, ?)"
	prepare ins_updstat1 from scratch
	let scratch = "insert into updstat2 values(?, ?, ?, ?)"
	prepare ins_updstat2 from scratch
	# Prepare select cursors
	let scratch =
"select colname",
" from updstat1",
" where tabname = ?",
" and level = ?"
	prepare tbl_col_stmt from scratch
	declare tbl_col_curs cursor for tbl_col_stmt

	let scratch =
"select tabname, dbspace, cost, nrows",
" from updstat2",
" order by cost desc"
	prepare cost_stmt from scratch
	declare cost_curs cursor for cost_stmt
end function
# init_tmp_cursors()
##################################################################
function build_lists()
###################################################################
# This function creates the lists of columns in the current table
# that need high, medium, and low commands.
	# Build the lists
	call build_hi_list()
	call build_med_list()
	call build_low_list()
end function
# build_lists()

##################################################################
function build_hi_list()
###################################################################
# This function creates the list of columns in the current table
# that need HIGH statistics: those that head an index and those
# that differentiate an index.
#
	define
		i			smallint
	# Reset the counter
	let g_numhi = 0
	# Build list of columns that head an index
	for i = 1 to g_numidx
		call add_to_hi_list(g_idxrec[i].col1)
	end for
	# Add columns that differentiate indexes that have the same
	# starting column(s).
	# (Use the table's dbspace).
	for i = 2 to g_numidx
		case
			when g_idxrec[i].col1 != g_idxrec[i-1].col1
exit case
			when g_idxrec[i].col2 != g_idxrec[i-1].col2
call add_to_hi_list(g_idxrec[i].col2)
			when g_idxrec[i].col3 != g_idxrec[i-1].col3
call add_to_hi_list(g_idxrec[i].col3)
			when g_idxrec[i].col4 != g_idxrec[i-1].col4
call add_to_hi_list(g_idxrec[i].col4)
			when g_idxrec[i].col5 != g_idxrec[i-1].col5
call add_to_hi_list(g_idxrec[i].col5)
			when g_idxrec[i].col6 != g_idxrec[i-1].col6
call add_to_hi_list(g_idxrec[i].col6)
			when g_idxrec[i].col7 != g_idxrec[i-1].col7
call add_to_hi_list(g_idxrec[i].col7)
			when g_idxrec[i].col8 != g_idxrec[i-1].col8
call add_to_hi_list(g_idxrec[i].col8)
			when g_idxrec[i].col9 != g_idxrec[i-1].col9
call add_to_hi_list(g_idxrec[i].col9)

			when g_idxrec[i].col10 != g_idxrec[i-1].col10
call add_to_hi_list(g_idxrec[i].col10)
			when g_idxrec[i].col11 != g_idxrec[i-1].col11
call add_to_hi_list(g_idxrec[i].col11)
			when g_idxrec[i].col12 != g_idxrec[i-1].col12
call add_to_hi_list(g_idxrec[i].col12)
			when g_idxrec[i].col13 != g_idxrec[i-1].col13
call add_to_hi_list(g_idxrec[i].col13)
			when g_idxrec[i].col14 != g_idxrec[i-1].col14
call add_to_hi_list(g_idxrec[i].col14)
			when g_idxrec[i].col15 != g_idxrec[i-1].col15
call add_to_hi_list(g_idxrec[i].col15)
			when g_idxrec[i].col16 != g_idxrec[i-1].col16
call add_to_hi_list(g_idxrec[i].col16)
		end case
	end for
end function
# build_hi_list()

####################################################################
function build_med_list()
##################################################################
#
# This function creates the list of columns in the current table
# that need MEDIUM statistics: all columns in indexes that are
# not already identified as needing HIGH statistics.
#
	define
		i 		smallint
	# Reset the counter
	let g_nummed = 0
	# Build the list
	for i = 1 to g_numidx
		call add_to_med_list(g_idxrec[i].col1)
		if g_idxrec[i].col2 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col2)
		if g_idxrec[i].col3 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col3)
		if g_idxrec[i].col4 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col4)
		if g_idxrec[i].col5 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col5)
		if g_idxrec[i].col6 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col6)
		if g_idxrec[i].col7 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col7)
		if g_idxrec[i].col8 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col8)
		if g_idxrec[i].col9 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col9)
		if g_idxrec[i].col10 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col10)
		if g_idxrec[i].col11 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col11)
		if g_idxrec[i].col12 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col12)
		if g_idxrec[i].col13 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col13)
		if g_idxrec[i].col14 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col14)
		if g_idxrec[i].col15 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col15)
		if g_idxrec[i].col16 = 0 then continue for end if
		call add_to_med_list(g_idxrec[i].col16)
	end for
end function
# build_med_list()
##################################################################
function build_low_list()
##################################################################
#
# This function creates the list of columns in the current table
# that need LOW statistics: all columns in multicolumn indexes.
#
	define
		i			smallint
	# Reset the counter
	let g_numlow = 0
	# Build the list
	for i = 1 to g_numidx
		if g_idxrec[i].col2 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col1)
		call add_to_low_list(g_idxrec[i].col2)
		if g_idxrec[i].col3 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col3)
		if g_idxrec[i].col4 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col4)
		if g_idxrec[i].col5 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col5)
		if g_idxrec[i].col6 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col6)
		if g_idxrec[i].col7 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col7)
		if g_idxrec[i].col8 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col8)
		if g_idxrec[i].col9 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col9)
		if g_idxrec[i].col10 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col10)
		if g_idxrec[i].col11 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col11)
		if g_idxrec[i].col12 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col12)
		if g_idxrec[i].col13 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col13)
		if g_idxrec[i].col14 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col14)
		if g_idxrec[i].col15 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col15)
		if g_idxrec[i].col16 = 0 then continue for end if
		call add_to_low_list(g_idxrec[i].col16)
	end for
end function
# build_low_list()
##################################################################
function small_table()
##################################################################
#
# This function creates the SQL statement for a small table.
#
	define
		cost			float
	# Insert record for HIGH command
	execute ins_updstat1 using "H", g_tabname, ""
	# Insert cost record
	let cost = .001 * g_nrows + .2952
	execute ins_updstat2 using g_tabname, g_dbspace, cost, g_nrows
end function
# small_table()

##################################################################
function add_to_hi_list(colnum)
##################################################################
#
# This function adds a column to the list of columns that are to
# be updated HIGH if the column is not already in the list.
#
	define
		colnum		smallint,
		i			smallint
# Reverse sign if negative (negative column numbers in sysindexes
# indicate that the column is sorted descending instead of ascending)
	if colnum < 0
	then
		let colnum = - colnum
	end if
# Check the current list
	for i = 1 to g_numhi
		if colnum = g_hi_list[i]
		then
			#Column is already in the list: exit the function
			return
		end if
	end for
# Column not found: add to list
	let g_numhi = g_numhi + 1
	let g_hi_list[g_numhi] = colnum
end function
# add_to_hi_list()

##################################################################
function add_to_med_list(colnum)
##################################################################
#
# This function adds a column to the list of columns that are to
# be updated MEDIUM if the column is not already in the HIGH list
# and not already in the MEDIUM list.
#
	define
		colnum		smallint,
		i			smallint
# Reverse sign if negative (negative column numbers in sysindexes
# indicate that the column is sorted descending instead of
# ascending)
  if colnum < 0
	then
		let colnum = - colnum
	end if
# Check the HIGH list
	for i = 1 to g_numhi
		if colnum = g_hi_list[i]
		then
			# Column is already in the HIGH list: exit the function
			return
		end if
	end for
# Check the MEDIUM list
	for i = 1 to g_nummed
		if colnum = g_med_list[i]
		then
			# Column is already in the MEDIUM list: exit the function
			return
		end if
	end for
# Column not found: add to list
	let g_nummed = g_nummed + 1
	let g_med_list[g_nummed] = colnum
end function
# add_to_med_list()

##################################################################
function add_to_low_list(colnum)
##################################################################
#
# This function adds a column to the list of columns that are to
# be updated LOW if the column is not already in the list.
#
	define
		colnum		smallint,
		i			smallint
# Reverse sign if negative (negative column numbers in sysindexes
# indicate that the column is sorted descending instead of
# ascending)
	if colnum < 0
	then
		let colnum = - colnum
	end if
# Check the current list
	for i = 1 to g_numlow
		if colnum = g_low_list[i]
		then
			# Column is already in the LOW list: exit the function
			return
		end if
	end for
# Column not found: add to list
	let g_numlow = g_numlow + 1
	let g_low_list[g_numlow] = colnum
end function
# add_to_low_list()

##################################################################
function save_data()
##################################################################
#
# This function saves the info for one table in the temp tables.
# The formulas used to compute costs are based on empirical data.
#
	define
		cost			float,
		i			smallint,
		colno			smallint
	if g_numstrm > 1
	then
		let cost = 0
	end if
# Insert records for HIGH commands
	for i = 1 to g_numhi
		let colno = g_hi_list[i]
		execute ins_updstat1 using "H", g_tabname, g_colname[colno]
	end for
	let cost = cost + (9e-5 * g_nrows - 9.5925) * g_numhi
# Insert records for MEDIUM commands
	for i = 1 to g_nummed
		let colno = g_med_list[i]
		execute ins_updstat1 using "M", g_tabname, g_colname[colno]
	end for
	let cost = cost + 1e-5 * g_nrows + 27.336
# Insert records for LOW commands
	for i = 1 to g_numlow
		let colno = g_low_list[i]
		execute ins_updstat1 using "L", g_tabname, g_colname[colno]
	end for
	let cost = .0001 * g_nrows + .9661
# Insert cost record
	execute ins_updstat2 using g_tabname, g_dbspace, cost, g_nrows
end function
# save_data()

##################################################################
function write_stmts()
##################################################################
#
# This function reads the info from the temp tables and creates
# the SQL statements.
#
	define
		cost		float
	# For each record in the cost table
	foreach cost_curs
		into g_tabname, g_dbspace, cost, g_nrows
		#Set the output stream
		let g_currstrm = set_stream(g_dbspace, cost)
		#Write the statements
		call write_medium()
		call write_high()
		call write_low()
	end foreach
end function
# write_stmts()

##################################################################
function write_medium()
##################################################################
#
# This function writes the updates statistics medium commands for
# all columns in the g_med_list. If the table is large, distribu-
# tion and confidence are specified.
#
	define
		first		smallint,
		colname	char(18),
		level		char(1)
# Start the update statistics statement
	let scratch = "update statistics medium for table ",
			g_tabname clipped,
			" ("
	let first = true
# Append the column names
	let level = "M"
	foreach tbl_col_curs using g_tabname, level
	    into colname
		if first
		then
			let scratch = scratch clipped, colname
			let first = false
		else
			let scratch = scratch clipped,
				", ",
				colname
		end if
	end foreach
	if first
 then
		# No records found
		return
	end if

let scratch = scratch clipped, ")"
# Check for "large" tables
	if g_nrows >= 100000
	 then
		let scratch = scratch clipped,
				" resolution 1.00 0.99"
	end if
	let scratch = scratch clipped, " distributions only;"
	output to report report1()
end function
# write_medium()
##################################################################
function write_high()
##################################################################
#
# This function writes the updates statistics high commands for
# all columns in the g_hi_list.
#
	define
		colname	char(18),
		level		char(1)
# Process the list of columns
	let level = "H"
	foreach tbl_col_curs using g_tabname, level
    into colname
		let scratch = "update statistics high for table ",
					g_tabname
		if colname is null
		then
			# Small table. Do the whole table.
			let scratch = scratch clipped, ";"

		else
			let scratch = scratch clipped,
						" (",
						colname clipped,
						");"
		end if
		output to report report1()
	end foreach
end function
# write_high()
##################################################################
function write_low()
##################################################################
#
# This function writes the updates statistics low commands for
# all columns in the g_low_list.
#
	define
		first		smallint,
		colname	char(18),
		level		char(1)
# Start the update statistics statement
	let scratch = "update statistics low for table ",
				g_tabname clipped,
				" ("
	let first = true

# Append the column names
	let level = "L"
	foreach tbl_col_curs using g_tabname, level
		into colname
			if first
			then
				let scratch = scratch clipped, colname
				let first = false
			else
				let scratch = scratch clipped,
							", ",
							colname
			end if
		end foreach
		if first
		then
			# No records found
			return
		end if
		let scratch = scratch clipped, ");"
		output to report report1()
end function
# write_low()

##################################################################
function set_stream(dbspace, cost)
##################################################################
#
# This function sets the global variable g_currstrm with the
# stream or output file number for the current statement. For
# auto_stream, the dbspace provided as a parameter determines the
# stream. For fixed multi-stream case, the stream is selected by
# determining which stream has the least cost so far. The cost
# for the selected stream is updated to include the cost of the
# current table.
#
	define
		dbspace			char(3),
		cost				float, # Cost of this statement
		mincost			float,
		mincost_strm		smallint,
		i			smallint
	# Handle auto-stream
	if g_auto = true
	then
		# See if the dbspace is in the array
		for i = 1 to g_numstrm
			if dbspace = g_stream[i].name
			then
					return i
			end if
			end for
			# Not found: add to the array
			let g_numstrm = g_numstrm + 1
			let g_currstrm = g_numstrm
			let g_stream[g_numstrm].name = dbspace
			return g_numstrm

	end if
	# Handle single stream
	if g_numstrm = 1
	then
		return 1
	end if
	# Handle multi-stream
	let mincost = g_stream[1].cost
	let mincost_strm = 1
	for i = 2 to g_numstrm
		if g_stream[i].cost < mincost
		then
			let mincost = g_stream[i].cost
			let mincost_strm = i
		end if
	end for
	# Update the cost
	let g_stream[mincost_strm].cost = g_stream[mincost_strm].cost + cost
	return mincost_strm
end function
# set_stream()

####################################################################
report report1()
####################################################################
	define
		prepend	char(4),
		len		smallint,
		comma		smallint,
		i		smallint
	output
	    top margin 	0
	    bottom margin	0
	    left margin	0
	    right margin	255
	    page length	1
	    format
	    on every row
			# Build the prepend string
			if g_single
			then
				let prepend = ""
			else
				let prepend = g_currstrm using "<<<|"
			end if
		# Prepend the output file number
		let scratch = prepend clipped,
					scratch
		let len = length(scratch)

		# If the line is longer than 80 bytes, break it into
80-byte lines
		while len > 0
			if len <= 80
			then
				print scratch clipped
				exit while
			else
				# Find break point
				let comma = 0
				for i = 81 to 1 step -1
					if scratch[i] = " "
					then
						print scratch[1, i-1]
						let scratch = prepend clipped,
									  scratch[i+1, len]
						let len = length(scratch)
						continue while
					else
						if scratch[i] = ","
						then
							let comma = i
						end if
					end if
				end for
				# No space found
				if comma > 0
				then
					# Break at a comma
					print scratch[1, comma]
					let scratch = prepend clipped,
								scratch[comma+1, len]
					let len = length(scratch)
					continue while
				else

				# No comma found: break on first space or comma
				for i = 81 to len
					if scratch[i] = " "
					then
						print scratch[1, i-1]
						let scratch = prepend clipped,
									scratch[i+1, len]
						let len = length(scratch)
						continue while
					else
						if scratch[i] = ","
						then
							print scratch[1, i]
							let scratch = prepend clipped,
									scratch[i+1, len]
							let len = length(scratch)
							continue while
						end if
					end if
				end for
			end if
			# Nothing to break on: print the whole line
			print scratch clipped
			exit while
		end if
	end while
end report
# report1()

updstat.awk
This is the awk program that splits the output file from the 4GL program into multiple SQL command files. Each line of the output file has a file number followed by a pipe symbol "|" and then the SQL command. The awk program reads the file number and writes the command to the appropriate file.



BEGIN \
{
		FS = "|";
}
{
		close(outfile);
		if ($1 == "")
			outfile = sprintf("updstat.%s.PROC.sql", dbname);
		else
		{
			filenum = $1;
			# See if this is a new output file
			found = 0;
			for (i=1; i<=imax; i++)
				if (filenum == filelist[i])
				{
					found = 1;
					break;
				}
		if ( found == 0)
		{
			# Start a new output file
			filelist[++imax] = filenum;
			outfile = sprintf("updstat.%s.%03d.sql", dbname, filenum);
			print "set lock mode to wait;" > outfile;
		}
		else
			outfile = sprintf("updstat.%s.%03d.sql", dbname, filenum);
		print $2 >> outfile;
	}
}

Conclusion

A final word of caution: if you load a new table (or an entire database), you will want to run UPDATE STATISTICS LOW for the new or refreshed tables before running this program. The reason is that this program determines the number of rows in each table from the column nrows in the system catalog table systables. A new table will have 0 in this column until some form of the UPDATE STATISTICS command is run against the table. If the program finds 0 in this field, it will treat the table as a "small" table and simply run UPDATE STATISTICS HIGH. You probably do not want to run a HIGH command on every table in a database, or on very large tables.

While this is a fairly complex set of programs, I have found that its benefits far outweigh the cost of setting it up. This program produces a very thorough statistical analysis of the database for the optimizer to use, and the multi-stream options execute much more quickly than any other method that uses only a single process. It is also reasonably easy to revise, whether you want to simply change some of the empirical values or change the whole approach to updating statistics (Informix has been known to change their recommendations on this subject). I hope you will find the program as valuable as I have.

Acknowledgment

The original idea for this program came from Jerry Van Zant, who now works for Aggreso, Inc. Jerry provided his version of the program to me several years ago. I have modified it extensively since then, including adding the concept of multiple streams, but the basic idea of extracting table information from the system catalog tables and creating the UPDATE STATISTICS commands programmatically was his.

About the Author

Rick Baker is the Systems and Database Manager for Brightpoint, Inc. in Indianapolis. He has worked with Informix DBMS since 1989. Rick can be reached via email at rick.baker@brightpoint.com.