List Info

Thread: x2sh boosting process - revised from the core and up!




x2sh boosting process - revised from the core and up!
user name
2006-02-23 18:17:30
Hello Everybody,

Some reviewing and a _NEW_ post with - closer to a solution
of "our" "problem".

1. The x2sh library snapshot I posted is completely agnostic
of what it parses, but it is slow when used in a big series
of documents. I have tried this myself, 
in order to parse the entire LFS book and load it into a
semantically significant "table" simulation
using arrays in bash takes at about 40 - 50 min. This is 
because the script code is complex, requires a lot of
counter variables while parsing, but most of all, it reads
everything in the xml source it parses 
character by character. This is completely unacceptable
under _any_ circumstance. Fact is, that however
"weird" the syntax in the file may be, because
of its 
character it would parse literally anything.

2. The genXML boosting parser script is extremely fast, for
it takes only 30 - 40 s (a decrease in time by _60_ times
and up), it is able to seek within the 
complete set of files for any element of type
<element> , </element> and dump its output to a
_valid_ xml file with every <element> and its
contained data. 
unfortunately it does not support attributes so '<screen
role="no dump">' "patterns" can
escape. It is bash 3.0 compliant only because it uses the
"perlish" =~ 
operator only implemented there. Incorporating attribute -
awareness and inline - parsing within genXML would
eventually lead to increased complexity especially 
when element attributes are laid out in a multi - line
manner. This would lead to having more buffers, more
variables, more lockups and eventually go to a 
situation like (1), making no practical use of the advantage
the =~ operator offers.



******************************************

It appears obvious, that by design, the best approach would
be to dump character-by-character parsing (1), and make a
very simple script that reads the XML 
files and does not parse them, but instead, makes so that
for a _given_ input, output is either data contained within
a _pair_ of < and > characters or not. 
Then, "parsing" can be done in that specific
output using =~ operators and others alike, while
eliminating the need for more complex scripts. I present to
you 
my newest implementation, going towards that direction.

*****************************************


Two versions of this new script are to be posted:

1. Everything is loaded into a conventional bash array where
each entry contains data within <,> pair or NOT.
Nothing is printed on screen.
2. Everything is printed on screen while
"parsed" on the fly, nothing is stored in arrays
or anything.

The reason for the two versions is simple: every printf
builtin call takes some time. To make an option for printing
the resulting array is useless for the time 
I get while displaying exceeds 4 (four) min while if I
display it on the fly it takes nearly 1 min 20 s. Values
under normal operational load. The version 
printing on the fly results useful on testing (making sure
that no lines are omited during output, counter variables
are set right, ecc). So far I think I have 
worked out possible pitfalls and bugs, but you never know.
Since this is a direction better than before (the silent
script parses and loads everything in the 
array in less than 50 s). I am also checking out the bash
source files to see if there is any peculiar instruction or
coding style that may be efficient for 
this kind of scripting.

Note that the way the case constructs are laid out make so
that bash treats them as && and ||
"commands", so we are releaved of the need to
use many 'if' 
structures coupled with && or || operators. This is
detrimental to understanding how the script works. Also
elements & attributes are preserved in their 
entirety and ready for regexp based filtering within the
array.

Also, for the ones that want xpointer and related stuff
_NOW_ (but the core must be worked out first!) try this for
now: ./<scriptname>.sh | grep "xpointer"
on 
the version of the script that gives printed output.
Xinclude and related issues can be easily solved if the
entirety of the book is parsed and loaded in a 
semantically and topographically meaningful manner in a time
schedule of less than 2 (two) m. This is simply a demo,
please remember that and bear me. Check the 
attachment for various versions.

Take note that inline DTD elements and comments within the
xml documents are considered TRASH (of no importance). Check
out previous x2sh for entity 
dereferencing (it is very quick even in the character - to -
character parsing edition, more so on an approach as this).


Average execution times (complete LFS 6.1.1 xml source)

1. silent version: ~ 50 s almost even distribution between
user / sys.
2. printing version: ~ 1 min 30 s distribution in favour of
user vs sys.

Reducing number of sources and filtering input - types to
the script leads to almost _factor_ decreases in execution
time.

All under normal operational load (web browsing, various
editors and java proggies running...). Having forced
"parsing" for all xml files of the book has made

it easier for me to debug some issues regarding counter
variables and string manipulation that can be of use in a
more "uninformed" version of the algorithm as 
laid out in the script. Thank you for your patience and
undestanding. This script will run under both bash 2.x and
3.x versions.

MD5SUM is 6207d36085782fa45b3fb4f2115f8c67 *makeall.tar.bz2



Thank you for hosting my ideas on your mailing list. Waiting
for your comments and bug reports.

George Makrydakis

gmak


#------------------------cut--------------------------------
-------------


#!/bin/bash

# x2sh booster - for the x2sh component to the jhalfs
project
# author: George Makrydakis > gmakmail a|t gmail d0t c0m
<
# license: GPL 2.0 or up
# revision: A1-print-nocomment
# instructions: run in the LFS book root

	declare -a x2SHraw
	declare -a x2SHchapters=(chapter01 \
				chapter02 \
				chapter03 \
				chapter04 \
				chapter05 \
				chapter06 \
				chapter07 \
				chapter08 \
				chapter09);
				
	declare -i x2SHindex=0
	declare -i lcnt=0

	declare  x2SHfile
	declare  originalsize

	declare otag
	declare ctag
	declare mpnt1
	declare mpnt2
	declare srcvar

	for x2SHpart in ${x2SHchapters[]}
	do
		cd $x2SHpart
	for x2SHfile in *.xml
	do
		x2SHraw=(); lcnt=0;
		while read x2SHraw[lcnt]
		do
			((lcnt++))
		done <"$x2SHfile"

	for ((lcnt=0; lcnt < ${#x2SHraw[]}; lcnt++));
	do
		case ${x2SHraw[lcnt]} in
			'')
			;;
			*)
				case ${x2SHraw[lcnt]} in
					*\<*)
						if [  "${x2SHraw[lcnt]%%<*}" !=
"" ] ; then
							printf "%s\n"
"${x2SHraw[lcnt]%%<*}"
						fi
					;;
					*)
						if [ "${x2SHraw[lcnt]#>}" =
"${x2SHraw[lcnt]}" ] ; then
							printf "%s\n"
"${x2SHraw[lcnt]}"
						fi
					;;
				esac

			;;
		esac

		mpnt1="${x2SHraw[lcnt]}"
		mpnt2="${x2SHraw[lcnt]}"
		originalsize="${#x2SHraw[lcnt]}"

		until [ "$mpnt1" =
"${x2SHraw[lcnt]##*<}" ] && \
		      [ "$mpnt2" =
"${x2SHraw[lcnt]##*>}" ] ;	
		do
			mpnt1=${mpnt1#*<}; mpnt2=${mpnt2#*>}
			otag=$((originalsize - ${#mpnt1} - 1))
			ctag=$((originalsize - ${#mpnt2} - otag))
			if [ $ctag -ge 0 ] ; then
				printf "%s\n"
"${x2SHraw[lcnt]:$otag:$ctag}"
				srcvar="$mpnt1";
srcvar="${srcvar#*>}";
srcvar="${srcvar%%<*}"
				case "$srcvar" in
					'')
					;;
					*)
						printf "%s\n" "$srcvar"
						srcvar=""
					;;
					
				esac
			elif [ $ctag -lt 0 ] ; then
				x2SHraw[$((lcnt +
1))]="<""${x2SHraw[lcnt]##*<}"
" ${x2SHraw[$((lcnt + 1))]}"
				break
			fi
		done
	done
done
cd ..
done


#---------------------------------cut-----------------------
--------------------------------------------------

-- 
http://linuxfromscratch.org/mailman/listinfo/alfs-discu
ss
FAQ: http://www.linux
fromscratch.org/faq/
Unsubscribe: See the above information page
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )