List Info

Thread: Webboard: parsing to remove html and to extract filenames




Webboard: parsing to remove html and to extract filenames
user name
2007-09-11 00:58:41
Author: Alexander Barkov
Email: barmnogosearch.org
Message:
Hi,

> Hi,
> 
> I am running mnogosearch and it works fine in native
config.
> 
> I need to do two things which i cant seem to figure
out.
> 
> 1.) Many pages from a given URL have the same text on
every page and a small amount of unique info. (ie: ebay) I
want to remove all the non-unique text from the stored info
and have no idea how to do this.

You can either put the
<!--UdmComment-->...<!--/UdmComment-->
tags around the text you don't wish to index, or
use a user defined section, with "expression" and
"replacement".

This example extracts text between the
"<h1>" and "</h1>" tags:

Section h1  29      128
"<h1>(.*)</h1>" $1


See here for more details:

http://www.mnogosearch.org/doc33/msearch-cmdref-sect
ion.html


> 
> 2.) I wish to extract particular filenames from links,
ie: a link to myvideo.avi from the url
www.blah.com/myvideo.htm. I have tried using the external
parsers section to output the from mime type text/html to
mime type text/html and write this to a file so I can see
what I am getting but the file ends up empty. Does anyone
know how to actually output and write to a file which can be
used by a parser evrything that mnogosearch sees on a url?
> 
> thanks

How does your "Mime" command look like?


Reply: <http://www.mnogosearch.org/board/message.php?id=19585&g
t;


------------------------------------------------------------
---------
To unsubscribe, e-mail: general-unsubscribemnogosearch.org
For additional commands, e-mail: general-helpmnogosearch.org


[1]

about | contact  Other archives ( Real Estate discussion Medical topics )