Как извлечь текст из PDF-документа в приложениях Java

Мы можем извлечь весь текст из PDF-файла в Java-приложениях с помощью Spire.PDF. Мы также можем извлечь некоторые тексты с определенной страницы или определенной области из PDF-файла. В этой статье мы покажем вам, как извлечь текст из PDF-файла на Java с помощью бесплатного Spire.PDF для библиотеки Java.

Зависимости

Прежде всего, нам нужно добавить необходимые зависимости, чтобы добавить бесплатный шпиль.PDF для Java в ваш Java-проект. Есть два способа сделать это. Если мы используем maven, нам нужно добавить следующий код в ваш проект pom.xml файл.

  
          
            com.e-iceblue  
            e-iceblue  
            http://repo.e-iceblue.com/nexus/content/groups/public/  
          
  
  
      
        e-iceblue  
        spire.pdf.free  
        2.6.3

Для проектов, не связанных с maven, скачайте бесплатный Spire.PDF для пакета Java с веб-сайта и добавьте Spire.Pdf.jar в папке lib в нашем проекте в качестве зависимости.

Во-первых, просмотрите образец PDF-файла.

Извлеките все тексты из всего PDF-файла. Шпиль.PDF предлагает метод page.extract Text() для легкого извлечения всех текстов в формате PDF.

import com.spire.pdf.*;
import com.spire.pdf.PdfPageBase;
import java.io.*;


public class extractAllTexts {
    public static void main(String[] args)  throws Exception{
        String input = "Sample.pdf";

        //Load the PDF file
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(input);

        //Create a new txt file to save the extracted text
        String result = "output/extractAllText.txt";
        File file=new File(result);
        if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=new FileWriter(file,true);
        BufferedWriter bw=new BufferedWriter(fw); 

        //Extract text from all the pages on the PDF
        PdfPageBase page;
        for(int i=0;i

Извлекать текст из определенной области. Мы могли бы определить специальную область с одной страницы PDF, а затем извлечь текст из этой области по странице.ExtracttexT(новый прямоугольник2d.Float(80, 200, 500, 200)) метод.

import com.spire.pdf.*;
import java.awt.geom.Rectangle2D;
import java.io.*;

public class extractTextFromSpecificArea {
    public static void main(String[] args)  throws Exception{

        String input = "Sample.pdf";

        //Load the PDF file
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(input);

        //Create a new txt file to save the extracted text
        String result = "output/extractText.txt";
        File file=new File(result);
        if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=new FileWriter(file,true);
        BufferedWriter bw=new BufferedWriter(fw);

        //Get the first page
        PdfPageBase page = pdf.getPages().get(0);

        //Extract text from a specific rectangle area within the page
        String text = page.extractText(new Rectangle2D.Float(80, 200, 500, 200));
        bw.write(text);

        bw.flush();
        bw.close();
        fw.close();
    }
}

Извлеките выделенный текст из PDF-файла. Некоторые PDF-файлы добавят выделенный цвет для некоторых текстов. Шпиль.PDF предлагает метод page.extractText(аннотация разметки текста.getBounds()) для извлечения выделенного текста из PDF.

import com.spire.pdf.*;
import java.io.*;import com.spire.pdf.annotations.*;
import com.spire.pdf.graphics.*;


public class extractHighlightedText {
    public static void main(String[] args)  throws Exception{

        String input = "Sample.pdf";

        //Load the PDF file
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(input);

        //Create a new txt file to save the extracted text
        String result = "output/extractText1.txt";
        File file=new File(result);
        if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=new FileWriter(file,true);
        BufferedWriter bw=new BufferedWriter(fw);

        bw.write("Extracted highlighted text:");
        PdfPageBase page = pdf.getPages().get(0);

        for (int i = 0; i < page.getAnnotationsWidget().getCount(); i++) {
            if (page.getAnnotationsWidget().get(i) instanceof PdfTextMarkupAnnotationWidget) {
                PdfTextMarkupAnnotationWidget textMarkupAnnotation = (PdfTextMarkupAnnotationWidget) page.getAnnotationsWidget().get(i);
                bw.write(page.extractText(textMarkupAnnotation.getBounds()));
                //Get the highlighted color
                PdfRGBColor color = textMarkupAnnotation.getColor();
                bw.write(+(color.getR() & 0XFF) +","+(color.getG() & 0XFF)+","+(color.getB() & 0XFF)+"\n");
            }
        }

        bw.flush();
        bw.close();
        fw.close();
    }
}

Оригинал: “https://dev.to/eiceblue/how-to-extract-text-from-pdf-document-in-java-applications-28e3”