List Info

Thread: Extracting data from a postscript file




Extracting data from a postscript file
user name
2006-08-31 21:11:59
Rick wrote:
> Hi,
>
> I am having an issue capturing all the data I need in a
particular
> postscript format.
>
> The data information is broken down as follows:
>
> The co-ordinates for the data xxxx (space) yyyy (space)
the letter 'M'
> (space) and the data is in the brackets.
>
> The next character after the closing bracket can either
a opening
> square bracket ([) an 'S' or an end of line (This
determines that this
> is valid data).
>
> There are 3 different data format scenarios.
>
> 1. 400 767 M (Data I need to extract) -> suffix
characters [ or S or
> eol
> 2. 400 767 M (data \\( I need to extract \\) with
nested brackets)S
>     The nested brackets will always be escaped with
'\\'
> 3. 175 3303 M (t)S  188 3303 M (t)S  202 3303 M (p)S
etc  which is all
> on one line
>
> The coordinates may not be the first characters on the
line.
>
> I have had some success with
>
[0-9]{1,4}\s[0-9]{1,4}\s[M]\s\(.*[^\\](?=\)(\z|S|\[
)) but this doesn't
> handle data format 3 as it extracts the whole line
rather than breaking
> it down to each data section.
>
> The 2 critical issues I need to address are:
> -Handling data format 3, which also include nested
brackets
> -Ensuring that other unwanted data is not captured that
may be in a
> similar format but will not contain one of the 3 suffix
characters
> mentioned above.
>
> I am new to regular expressions and this is doing my
head in so your
> time is appreciated I you can help me.

If you want to separate every data satisfying your
conditions into
different records, then you may try:

 
(\d+\s+\d+\s+M\s+\([^\\()]*(?:[\\][\\].[^\\()]
*)*\)(?:S|\s|$|\[))

which means:

(                   # start of capuring $1
  \d+\s+\d+    # two numbers separated by spaces
  \s+M\s+      # followed by a capital letter 'M'
  \(                # opening paren
    [^\\()]*       # anything except parens and backslash
    (?:            # start of grouping
      [\\][\\].     # escaped any characters in the form
of \\.
      [^\\()]*     # anything except parens and backslash
    )*              # end of grouping, any number of such
groups
  \)                 # closing paren
  (?:S|$|\[)     # followed by 'S', eol, or '['
)                   # end of capturing $1

For the following input string:(use global search)
_________________________
1. 400 767 M (Data I need to extract) -> suffix
characters [ or S or
eol
2. 400 767 M (data \\( I need to extract \\) with nested
brackets)S
    The nested brackets will always be escaped with '\\'
3. 175 3303 M (t)S  188 3303 M (t)S  202 3303 M (p)S etc 
which is all
on one line
__________________________

(Line-1 doesnot have a wanted suffix) it prints 4 records:
___________________________
400 767 M (data \\( I need to extract \\) with nested
brackets)S
175 3303 M (t)S
188 3303 M (t)S
202 3303 M (p)S
____________________
But if you want only 2 records, and hold the last three into
one, then
just do some minor modifications, like:

((?:\d+\s+\d+\s+M\s+\([^\\()]*(?:[\\][\\].[^\\
()]*)*\)(?:S|$|\[)\x20*)+)

(contents captured in $1) use an outer grouping parens and
some
possible spaces \x20* to join adjacent same data structures
on one
line, you may also want to change \x20* to [ \t]* which
contains also
TAB as the connectors.

BTW. you may need to take care of the backslash problems,
some tools
may need more than two backslashes to escape one backslash,
here I just
used a character-class [\\] to spacify one backslash.

Good luck,
Xicheng


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regexgooglegroups.com
To unsubscribe from this group, send email to
regex-unsubscribegooglegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---

Extracting data from a postscript file
user name
2006-09-08 02:23:09
Xicheng Jia wrote:
> Rick wrote:
> > Hi,
> >
> > I am having an issue capturing all the data I need
in a particular
> > postscript format.
> >
> > The data information is broken down as follows:
> >
> > The co-ordinates for the data xxxx (space) yyyy
(space) the letter 'M'
> > (space) and the data is in the brackets.
> >
> > The next character after the closing bracket can
either a opening
> > square bracket ([) an 'S' or an end of line
(This determines that this
> > is valid data).
> >
> > There are 3 different data format scenarios.
> >
> > 1. 400 767 M (Data I need to extract) -> suffix
characters [ or S or
> > eol
> > 2. 400 767 M (data \\( I need to extract \\)
with nested brackets)S
> >     The nested brackets will always be escaped
with '\\'
> > 3. 175 3303 M (t)S  188 3303 M (t)S  202 3303 M
(p)S etc  which is all
> > on one line
> >
> > The coordinates may not be the first characters on
the line.
> >
> > I have had some success with
> >
[0-9]{1,4}\s[0-9]{1,4}\s[M]\s\(.*[^\\](?=\)(\z|S|\[
)) but this doesn't
> > handle data format 3 as it extracts the whole line
rather than breaking
> > it down to each data section.
> >
> > The 2 critical issues I need to address are:
> > -Handling data format 3, which also include nested
brackets
> > -Ensuring that other unwanted data is not captured
that may be in a
> > similar format but will not contain one of the 3
suffix characters
> > mentioned above.
> >
> > I am new to regular expressions and this is doing
my head in so your
> > time is appreciated I you can help me.
>
> If you want to separate every data satisfying your
conditions into
> different records, then you may try:
>
>  
(\d+\s+\d+\s+M\s+\([^\\()]*(?:[\\][\\].[^\\()]
*)*\)(?:S|\s|$|\[))
>
> which means:
>
> (                   # start of capuring $1
>   \d+\s+\d+    # two numbers separated by spaces
>   \s+M\s+      # followed by a capital letter 'M'
>   \(                # opening paren
>     [^\\()]*       # anything except parens and
backslash
>     (?:            # start of grouping
>       [\\][\\].     # escaped any characters in the
form of \\.
>       [^\\()]*     # anything except parens and
backslash
>     )*              # end of grouping, any number of
such groups
>   \)                 # closing paren
>   (?:S|$|\[)     # followed by 'S', eol, or '['
> )                   # end of capturing $1
>
> For the following input string:(use global search)
> _________________________
> 1. 400 767 M (Data I need to extract) -> suffix
characters [ or S or
> eol
> 2. 400 767 M (data \\( I need to extract \\) with
nested brackets)S
>     The nested brackets will always be escaped with
'\\'
> 3. 175 3303 M (t)S  188 3303 M (t)S  202 3303 M (p)S
etc  which is all
> on one line
> __________________________
>
> (Line-1 doesnot have a wanted suffix) it prints 4
records:
> ___________________________
> 400 767 M (data \\( I need to extract \\) with
nested brackets)S
> 175 3303 M (t)S
> 188 3303 M (t)S
> 202 3303 M (p)S
> ____________________
> But if you want only 2 records, and hold the last three
into one, then
> just do some minor modifications, like:
>
>
((?:\d+\s+\d+\s+M\s+\([^\\()]*(?:[\\][\\].[^\\
()]*)*\)(?:S|$|\[)\x20*)+)
>
> (contents captured in $1) use an outer grouping parens
and some
> possible spaces \x20* to join adjacent same data
structures on one
> line, you may also want to change \x20* to [ \t]*
which contains also
> TAB as the connectors.
>
> BTW. you may need to take care of the backslash
problems, some tools
> may need more than two backslashes to escape one
backslash, here I just
> used a character-class [\\] to spacify one backslash.
>
> Good luck,
> Xicheng

Thanks Xicheng your comments were helpfull in understanding
regex a bit
more but I managed to solve my issue before this post was
published (a
long time!) My resolution was
([0-9]{1,4}\s[0-9]{1,4})\s[M]\s\(((?>.*?[^\\]\)))(
\r|\z|S|\[)

Thanks again
Rick


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regexgooglegroups.com
To unsubscribe from this group, send email to
regex-unsubscribegooglegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )