BLOB-stored PDF’s can be indexed using Oracle Text using the AUTO_FILTER. This filter can also be applied when just converting the PDF to plain text. You don’t need an index in that case. However, this eats-all-type filter might be missing out on specific PDF documents, even if the PDF version in combination with its security settings are supported. What are the alternatives…
Let me describe the problem more clearly first. I recently had to investigate an issue with converting binary stored PDF files ( BLOB column ) to plain text ( CLOB column ). Essentially the data was modeled like this:
I had started to implement this business requirement using Oracle Text, specifically using its CTX_DOC.policy_filter procedure. This procedure requires a so called named policy which has to be created in advance (on-time) using CTX_DLL.create_policy. My first approach was to create this policy specifying the previously mentioned AUTO_FILTER.
It worked like a charm…regrettably with the exception of just a few specific PDF documents. I did not receive error messages, but closer inspection of the resulting documents revealed textual properties the end-user would not have accepted. For example, some paragraphs were converted such that each word appeared in a separate line surrounded by three additional empty lines. If AUTO_FILTER would be used to text-index the document, that would not be a problem. For the specific text-based (ML) algorithm that had to read the converted text, it was.
Conversion of PDF documents using AUTO_FILTER works like this. A soon as the RDBMS needs access to the enclosed text stored in the binary document a so-called third-party filter is triggered to convert this document. For Oracle Text, this particular filter is the ctxhx executable living in the $ORACLE_HOME/ctx/bin folder. This executable ( backed-up by a long list of libraries) is very much capable of determining the type of binary document and handling that type appropiately (well mostly). As such, it is part of the Oracle Outside In Technology (OIT). Roger Ford, Oracle Product Manager, has written quite some blog posts on OIT and how that helps on Text-indexing binary documents, see here.
However, as described above, using AUTO_FILTER just wasn’t appropiate enough in the customer’s case. This is where USER_FILTER came to the rescue.
Of course, more or less at this point, I did think about raising an Oracle Support ticket. However, this would involve uploading at least one example of a PDF demonstrating this behaviour. The problem with this had to do with sharing the content of the PDF….
With USER_FILTER an alternative executable can be referenced. This executable has to be placed in the same folder as mentioned above. It must accept two parameters ( filename in, filename out) and of course some file privileges need to be in place in order for the RDBMS to be able to execute it.
I named my executable userfilterpdftotext.sh, revealing that in fact the pdftotext command-line utility is called. This is the content of the script:
#!/bin/sh /usr/bin/pdftotext -enc "UTF-8" -q "$1" "$2"
In other words the script runs quietly outputting UTF-8 encoded text. The -q is mandatory here: any terminal output would be interpreted as if an error had occured during conversion.
Create a preference to be able to use the script above:
begin ctx_ddl.create_preference ( preference_name => 'PDF_Filter_Preference' , object_name => 'USER_FILTER' ); ctx_ddl.set_attribute ( preference_name => 'PDF_Filter_Preference' , attribute_name => 'COMMAND' , attribute_value => 'userfilterpdftotext.sh' ); ctx_ddl.create_policy ( policy_name => 'PDF_Filter_Policy' , filter => 'PDF_Filter_Preference' ); end; /
After which, essentially ( bulk-fetch and error handling had been used in the actual implementation) the following script was run to convert the binary pdf documents:
declare cursor c_doc is select doc.binary_pdf , doc.plain_text from pdf_documents doc for update of doc.plain_text; r_doc c_doc%rowtype; begin open c_doc; fetch c_doc into r_doc; while c_doc%found loop ctx_doc.policy_filter ( policy_name => 'PDF_Filter_Policy' , document => r_doc.binary_pdf , restab => r_doc.plain_text , plaintext => TRUE , format => 'BINARY' ); update pdf_documents doc set doc.plain_text = r_doc.plain_text where current of c_doc; fetch c_doc into r_doc; end loop; commit; end; /
The script above has been written to a package named “EXTRACTION”. The following SQL*Plus demonstrates the result. Note that for both LOB columns, just a small portion of the is shown here.
SQL> select * from pdf_documents 2 / BINARY_PDF PLAIN_TEXT ------------------------------ ---------------------------------- 255044462D312E340A25E2E3CFD30D 255044462D312E340A25E2E3CFD30D 255044462D312E340A25E2E3CFD30D SQL> call extraction.pdftotext() 2 / Call completed. SQL> select * from pdf_documents 2 / BINARY_PDF PLAIN_TEXT ------------------------------ ---------------------------------- 255044462D312E340A25E2E3CFD30D Oracle® Database Database PL/S 255044462D312E340A25E2E3CFD30D Oracle® Database SQL Language 255044462D312E340A25E2E3CFD30D Oracle® Text Reference 19c E9