How to extract South Korean patent application number from PDF

***Note: This approach will only work if the patent PDF contains text and is not a scanned image. If you can select the text in your PDF reader, it’s likely a suitable PDF patent.***Note that Espacenet downloads PDF patents that do not contain text.

South Korean patents like this example list an application number that is listed on the front page (marked in red):


In this example, the application number is 10-2019-0094876.

In order to automatically extract the number, we can use pdftotext together with the ubiquitous Linux tools grep and tail.

First, download the original PDF (e.g. from Google Patents). In this example, the file is named KR20190098928A_Original_document_20200123004431.pdf.

Now run pdftotext on this file:

pdftotext KR20190098928A.pdf

This will produce a text file named KR20190098928A.txt containing all the text from the original PDF.

Now we can grep for 출원번호 which is the Korean term for application number, together with (21), which is the column number for the application number. Just seeing boxes? Don’t worry, your computer knows what it means, you just don’t have a south korean font installed - just ignore it.

Now we can filter out only the information we want:

grep --after=1 "(21) 출원번호" KR20190098928A.txt | tail -n 1

In our example, this will print 10-2019-0094876.

How does it work?

The relevant section in KR20190098928A.txt looks like this:

(21) 출원번호

We basically grep for the content of the first line and tell grep to print one line after the match (--after 1). This will print both the matching line and the line containing the application number. Now we can use tail -n 1 to print just the last line (-n 1) from the output.

Need any other information from the patent metadata? Often you can use a similar approach and just modify the grep statement. In some cases, consider using -layout or -raw as option to pdftotext.

Need professional software engineering services when automatically extracting data from your PDFs? Checkout TechOverflow consulting.