List Info

Thread: note 74211 added to ref.pdf




note 74211 added to ref.pdf
user name
2007-03-30 00:09:17
I am trying to extract the text from PDF files and use it to
feed a search engine (Intranet tool). I tried several
functions "PDF2TXT" posted below, but not they do
not produce the expected result. At least, all words need to
be separated by spaces (then used as keywords), and the
"junk" codes removed (for example: binary data,
pictures...). I start modifying the interesting function
posted by Swen, and here is the my current version that
starts to work quite well (with PDF version 1.2). Sorry for
having a quite different style of programming. Luc

<?php
// Patch for pdf2txt() posted Sven Schuberth
// Add/replace following code (cannot post full program,
size limitation)

// handles the verson 1.2
// New version of handleV2($data), only one line changed
function handleV2($data){
        
    // grab objects and then grab their contents (chunks)
    $a_obj =
getDataArray($data,"obj","endobj");
    
    foreach($a_obj as $obj){
        
        $a_filter =
getDataArray($obj,"<<",">>")
;
    
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] =
$a_filter[0];

            $a_data =
getDataArray($obj,"streamrn","endstream&quo
t;);
            if (is_array($a_data)){
                $a_chunks[$j]["data"] =
substr($a_data[0],
		strlen("streamrn"),
		strlen($a_data[0])-strlen("streamrn")-strlen(&
quot;endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it -
by looking at the contents of the filter
        $a_filter =
split("/",$chunk["filter"]);
        
        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding
has been used            
            if
(substr($chunk["filter"],"FlateDecode")!
==false){
                $data = gzuncompress($chunk["data"]);
                if (trim($data)!=""){
		    // CHANGED HERE, before: $result_data .=
ps2txt($data);	
                    $result_data .= PS2Text_New($data);
                } else {
                
                    //$result_data .= "x";
                }
            }
        }
    }
    return $result_data;
}

// New function - Extract text from PS codes
function ExtractPSTextElement($SourceString)
{
$CurStartPos = 0;
while (($CurStartText = strpos($SourceString, '(',
$CurStartPos)) !== FALSE)
	{
	// New text element found
	if ($CurStartText - $CurStartPos > 8) $Spacing = ' ';
	else	{
		$SpacingSize = substr($SourceString, $CurStartPos,
$CurStartText - $CurStartPos);
		if ($SpacingSize < -25) $Spacing = ' '; else $Spacing =
'';
		}
	$CurStartText++;

	$StartSearchEnd = $CurStartText;
	while (($CurStartPos = strpos($SourceString, ')',
$StartSearchEnd)) !== FALSE)
		{
		if (substr($SourceString, $CurStartPos - 1, 1) != '\')
break;
		$StartSearchEnd = $CurStartPos + 1;
		}
	if ($CurStartPos === FALSE) break; // something wrong
happened
	
	// Remove ending '-'
	if (substr($Result, -1, 1) == '-')
		{
		$Spacing = '';
		$Result = substr($Result, 0, -1);
		}

	// Add to result
	$Result .= $Spacing . substr($SourceString, $CurStartText,
$CurStartPos - $CurStartText);
	$CurStartPos++;
	}
// Add line breaks (otherwise, result is one big line...)
return $Result . "n";
}

// Global table for codes replacement 
$TCodeReplace = array ('(' => '(', ')' => ')');

// New function, replacing old "pd2txt" function
function PS2Text_New($PS_Data)
{
global $TCodeReplace;

// Catch up some codes
if (ord($PS_Data[0]) < 10) return ''; 
if (substr($PS_Data, 0, 8) == '/CIDInit') return '';

// Some text inside (...) can be found outside the [...]
sets, then ignored 
// => disable the processing of [...] is the easiest
solution

$Result = ExtractPSTextElement($PS_Data);

// echo "Code=$PS_DatanRES=$Resultnn";

// Remove/translate some codes
return strtr($Result, $TCodeReplace);
}

?>
----
Server IP: 69.147.83.197
Probable Submitter: 61.7.174.57 (proxied: 61.7.174.57,
61.7.174.57)
----
Manual Page -- http://www.p
hp.net/manual/en/ref.pdf.php
Edit        -- https://master
.php.net/note/edit/74211
Del: integrated  -- h
ttps://master.php.net/note/delete/74211/integrated
Del: useless     -- http
s://master.php.net/note/delete/74211/useless
Del: bad code    -- htt
ps://master.php.net/note/delete/74211/bad+code
Del: spam        -- https:/
/master.php.net/note/delete/74211/spam
Del: non-english -- 
https://master.php.net/note/delete/74211/non-english
Del: in docs     -- http
s://master.php.net/note/delete/74211/in+docs
Del: other reasons-- https://mast
er.php.net/note/delete/74211
Reject      -- https://mast
er.php.net/note/reject/74211
Search      -- https://
master.php.net/manage/user-notes.php

-- 
PHP Notes Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub
.php


[1]

about | contact  Other archives ( Real Estate discussion Medical topics )