PDF to plain text, please

BLOB-stored PDF’s can be indexed using Oracle Text using the AUTO_FILTER. This filter can also be applied when just converting the PDF to plain text. You don’t need an index in that case. However, this eats-all-type filter might be missing out on specific PDF documents, even if the PDF version in combination with its security settings are supported. What are the alternatives…

Let me describe the problem more clearly first. I recently had to investigate an issue with converting binary stored PDF files ( BLOB column ) to plain text ( CLOB column ). Essentially the data was modeled like this:

(assume DOCS is the schema owner…)

I had started to implement this business requirement using Oracle Text, specifically using its CTX_DOC.policy_filter procedure. This procedure requires a so called named policy which has to be created in advance (on-time) using CTX_DLL.create_policy. My first approach was to create this policy specifying the previously mentioned AUTO_FILTER.

It worked like a charm…regrettably with the exception of just a few specific PDF documents. I did not receive error messages, but closer inspection of the resulting documents revealed textual properties the end-user would not have accepted. For example, some paragraphs were converted such that each word appeared in a separate line surrounded by three additional empty lines. If AUTO_FILTER would be used to text-index the document, that would not be a problem. For the specific text-based (ML) algorithm that had to read the converted text, it was.

Conversion of PDF documents using AUTO_FILTER works like this. A soon as the RDBMS needs access to the enclosed text stored in the binary document a so-called third-party filter is triggered to convert this document. For Oracle Text, this particular filter is the ctxhx executable living in the $ORACLE_HOME/ctx/bin folder. This executable ( backed-up by a long list of libraries) is very much capable of determining the type of binary document and handling that type appropiately (well mostly). As such, it is part of the Oracle Outside In Technology (OIT). Roger Ford, Oracle Product Manager, has written quite some blog posts on OIT and how that helps on Text-indexing binary documents, see here.

However, as described above, using AUTO_FILTER just wasn’t appropiate enough in the customer’s case. This is where USER_FILTER came to the rescue.

Of course, more or less at this point, I did think about raising an Oracle Support ticket. However, this would involve uploading at least one example of a PDF demonstrating this behaviour. The problem with this had to do with sharing the content of the PDF….

With USER_FILTER an alternative executable can be referenced. This executable has to be placed in the same folder as mentioned above. It must accept two parameters ( filename in, filename out) and of course some file privileges need to be in place in order for the RDBMS to be able to execute it.

I named my executable userfilterpdftotext.sh, revealing that in fact the pdftotext command-line utility is called. This is the content of the script:

#!/bin/sh
/usr/bin/pdftotext -enc "UTF-8" -q "$1" "$2"

In other words the script runs quietly outputting UTF-8 encoded text. The -q is mandatory here: any terminal output would be interpreted as if an error had occured during conversion.
Create a preference to be able to use the script above:

begin
  ctx_ddl.create_preference
  ( preference_name => 'PDF_Filter_Preference'
  , object_name     => 'USER_FILTER'
  );
  ctx_ddl.set_attribute
  ( preference_name => 'PDF_Filter_Preference'
  , attribute_name  => 'COMMAND'
  , attribute_value => 'userfilterpdftotext.sh'
  );
  ctx_ddl.create_policy
  ( policy_name     => 'PDF_Filter_Policy'
  , filter          => 'PDF_Filter_Preference'
  );
end;
/

After which, essentially ( bulk-fetch and error handling had been used in the actual implementation) the following script was run to convert the binary pdf documents:

declare
  cursor c_doc
  is
  select  doc.binary_pdf
  ,       doc.plain_text
  from    pdf_documents doc
  for update of doc.plain_text;
  r_doc c_doc%rowtype;
begin
  open c_doc;
  fetch c_doc into r_doc;
  while c_doc%found
  loop
    ctx_doc.policy_filter
    ( policy_name => 'PDF_Filter_Policy'
    , document    => r_doc.binary_pdf
    , restab      => r_doc.plain_text
    , plaintext   => TRUE
    , format      => 'BINARY'
    );
    update pdf_documents doc
    set    doc.plain_text = r_doc.plain_text
    where  current of c_doc;    
    fetch c_doc into r_doc;    
  end loop;
  commit;
end;
/

The script above has been written to a package named “EXTRACTION”. The following SQL*Plus demonstrates the result. Note that for both LOB columns, just a small portion of the is shown here.

SQL> select * from pdf_documents
  2  /

BINARY_PDF                     PLAIN_TEXT                              
------------------------------ ----------------------------------
255044462D312E340A25E2E3CFD30D                                         
255044462D312E340A25E2E3CFD30D                                         
255044462D312E340A25E2E3CFD30D                                         

SQL> call extraction.pdftotext()
  2  /

Call completed.

SQL> select * from pdf_documents
  2  /

BINARY_PDF                     PLAIN_TEXT                              
------------------------------ ----------------------------------
255044462D312E340A25E2E3CFD30D Oracle® Database                        
                               Database PL/S                           

255044462D312E340A25E2E3CFD30D Oracle® Database                        
                               SQL Language                            

255044462D312E340A25E2E3CFD30D Oracle® Text                            
                               Reference                               
                                                                       
                               19c                                     
                               E9                                      


Meer informatie

rob-van-den-berg1

Rob van den Berg

Oracle Database Consultant

+31 6 22 98 69 27 Stuur Rob een e-mail

Reacties

Er zijn nog geen reacties op dit bericht.

Plaats een reactie

Dit veld is verplicht.

Vul een geldig e-mailadres in.

Dit veld is verplicht.