Extract PDF's Text

Extract text from PDF documents using the `pdf-text` endpoint.

The pdf-text endpoint is for extracting text from PDF documents. In this tutorial we demonstrate just how easy it is to extract text from a PDF document using the pdf-text endpoint.

Check out our blog for tips and tutorials!

First we use cURL to call the endpoint directly as a REST call. We then use the DynamicPDF API client libraries to call the endpoint programmatically.

Required Resources

To complete this tutorial, you must add the Extract Text (pdf-text endpoint) sample to your samples folder in your cloud storage space using the File Manager. After adding the sample resources, you should see a samples/extract-text-pdf-text-endpoint folder containing the resources for this tutorial.

Sample	Sample Folder	Resources
Extract Text	`samples/extract-text-pdf-text-endpoint`	`fw4.pdf`

From the File Manager, download fw4.pdf to your local system; here we assume /temp/dynamicpdf-api-samples/extract-text.
After downloading, delete fw4.pdf from your cloud storage space using the File Manager.

Resource	Cloud/Local
`fw4.pdf`	local

tip

See Sample Resources for instructions on adding sample resources.

Obtaining API Key

This tutorial assumes a valid API key obtained from the DynamicPDF API's Portal. Refer to the following for instructions on getting an API key.

Apps and API Keys

tip

If you are not familiar with the File Manager or Apps and API Keys, refer to the following tutorial and relevant Users Guide pages.

Calling API Directly Using POST

The pdf-text endpoint takes a POST request. When using cURL, you specify the endpoint, the HTTP command, the API key and the local resources required. However, the pdf-text endpoint also allows specifying the starting page and page count as query parameters.

Let's extract the text of only the first two pages of the PDF. Because we only wish to extract the text from the first two pages, in addition to sending the PDF and API key in the request, we must also send two query string parameters, startPage and pageCount.

Figure 1. To extract the first two pages of this PDF select start page and the number of pages.

Parameter	Parameter Type	Value
`startPage`	Query	1
`pageCount`	Query	2

info

Setting the startPage and pageCount both to zero (or omitting the querystring parameters) defaults to getting all pages of the PDF.

Make Request Using API

Create the following cURL command where the PDF is sent to the endpoint as binary data and then execute the command.
Add the startPage and pageCount as querystring parameters to the request URL.
Specify the Content-Type as application/pdf so the request knows to get the binary data as a PDF.

curl -X POST "https://api.dpdf.io/v1.0/pdf-text?startPage=1&pageCount=2"
-H  "Content-Type: application/pdf"
-H  "Authorization: Bearer DP.xxx-api-key-xxx"
--data-binary "@c:/dynamic-pdf-api-samples/extract-text/fw4.pdf"

Execute the command and the following text is returned to the commandline as a JSON document.

[
    {
        "pageNumber": 1,
        "text": "[DynamicPDF Evaluation] Form  W-4\n(Rev. December 2020)\nDepartment of the Treasury  \nInternal Revenue Service \nEmployee’s Withholding Certificate\n ▶  Complete Form W-4 so that your employer can withhold the correct federal  income tax from your pay. \n ▶  Give Form W-4 to your empl ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"
    },
    {
        "pageNumber": 2,
        "text": "[DynamicPDF Evaluation] Form W-4 (2021) Page 2\nGeneral Instructions\nFuture Developments\nFor the latest information about developments related to \nForm W-4, such as legislation enacted after it was published, \ngo to www.irs.gov/FormW4 .\nPurpose of Form\nComplete Form W-4 so that yo ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"
    }
]

Calling Endpoint Using Client Library

To simplify development, you can also use one of the DynamicPDF API client libraries. Use the client library of your choice to complete this tutorial section.

Complete Source

You can access the complete source for this project at one of the following GitHub projects.

Language	File Name	Location (package/namespace/etc.)	GitHub Project
Java	`ExtractText.java`	`com.dynamicpdf.api.examples`	https://github.com/dynamicpdf-api/java-client-examples
C#	`Program.cs`	`ExtractText`	https://github.com/dynamicpdf-api/dotnet-client-examples
Nodejs	`ExtractText.js`	`nodejs-client-examples`	https://github.com/dynamicpdf-api/nodejs-client-examples
PHP	`ExtractText.php`	`php-client-examples`	https://github.com/dynamicpdf-api/php-client-examples
GO	`pdf-text-example.go`	`go-client-examples`	https://github.com/dynamicpdf-api/go-client-examples/tree/main
Python	`PdfTextExample.py`	`python-client-examples`	https://github.com/dynamicpdf-api/python-client-examples

tip

Click on the language tab of choice to view the tutorial steps for the particular language.

Available on NuGet:

Install-Package DynamicPDF.API

Create a new Console App (.NET Core) project named ExtractText.
Add the DynamicPDF.API NuGet package.
Create a new static method named Run.
Add the following code to the Run method.
Create a new PdfResource instance and load the path to the PDF.
Create a new PdfTextinstance and load the PdfResource instance in the constructor.
Set the PdfText instance's StartPage to one and PageCount to two.
Add a call to the PdfText instances Process method to call the pdf-text endpoint.
If successful, print the text as JSON to the console.

using DynamicPDF.Api;
using System;

namespace ExtractText
{
    class Program
    {
        static void Main(string[] args)
        {
            Run("DP.xxx-api-key-xxx", "C:/temp/dynamicpdf-api-samples/extract-text/");
        }

        public static void Run(String apiKey, String basePath)
        {
            PdfResource resource = new PdfResource(basePath + "fw4.pdf");
            PdfText pdfText = new PdfText(resource);
            pdfText.ApiKey = apiKey;
            pdfText.StartPage = 1;
            pdfText.PageCount = 2;
            PdfTextResponse response = pdfText.Process();
            if (response.IsSuccessful)
            {
                Console.WriteLine((response.JsonContent));
            } else
            {
                Console.WriteLine(response.ErrorJson);
            }
        }
    }
}

Available on NPM:

npm i @dynamicpdf/api

Use npm to install the DynamicPDF API module.
Create a new class named ExtractText.
Create a static Run method.
Add the following code to the Run method.
Create a new PdfResource instance and load the path to the PDF.
Create a new PdfTextinstance and load the PdfResource instance in the constructor.
Set the PdfText instance's startPage to one and pageCount to two.
Add a call to the PdfText instances process method to call the pdf-text endpoint.
If successful, print the text as JSON to the console.
Add a call the the ExtractText.Run method.

import { 
    PdfResource,
    PdfText 
} from "@dynamicpdf/api"

export class ExtractText {
    
    static async Run() {
        var resource = new PdfResource("C:/temp/dynamicpdf-api-samples/extract-text/fw4.pdf");
        var pdfText = new PdfText(resource);
        pdfText.apiKey = "DP.xxx-api-key-xxx";
        pdfText.startPage = 1;
        pdfText.pageCount = 2;
        var res = await pdfText.process();
        if (res.isSuccessful) {
            console.log(JSON.parse(res.content));
        } else {
            console.log(res.errorJson);
        }
    }
}
await ExtractText.Run();

Run the application node ExtractText.js and the JSON is output to the console.

Available on Maven:

https://search.maven.org/search?q=g:com.dynamicpdf.api

<dependency>
  <groupId>com.dynamicpdf.api</groupId>
  <artifactId>dynamicpdf-api</artifactId>
  <version>1.0.0</version>
</dependency>

Create a new Maven project and add the DynamicPDF API as a dependency.
Create a new class named ExtractText with a main method.
Create a new method named Run.
Add the Run method call to main.
Create a new PdfResource instance and load the path to the PDF.
Create a new PdfTextinstance and load the PdfResource instance in the constructor.
Set the PdfText instance's StartPage to one and PageCount to two.
Add a call to the PdfText instances process method to call the pdf-text endpoint.
If successful, print the text as JSON to the console.
Run the application and the text output, written as JSON, is written to the console.

package com.dynamicpdf.api.examples;

import com.dynamicpdf.api.PdfResource;
import com.dynamicpdf.api.PdfText;
import com.dynamicpdf.api.PdfTextResponse;

public class ExtractingText {

    public static void main(String[] args) {
        ExtractingText.Run("DP.xxx-api-key-xxx",
                "C:/temp/dynamicpdf-api-samples/extract-text/");

    }

    public static void Run(String apiKey, String basePath) {
        
        PdfResource resource = new PdfResource(basePath + "fw4.pdf");
        PdfText pdfText = new PdfText(resource);
        pdfText.setApiKey(apiKey);
        pdfText.setStartPage(1);
        pdfText.setPageCount(2);
        PdfTextResponse response = pdfText.process();
        
        if(response.getIsSuccessful()) {
            System.out.println(response.getJsonContent());
        } else {
            System.out.println(response.getErrorJson());
        }
    }
}

Available as a Composer package:

composer require dynamicpdf/api

Use composer to ensure you have the required PHP libraries.
Create a new class named ExtractText.
Add a Run method.
Create a new PdfResource instance and load the path to the PDF.
Create a new PdfTextinstance and load the PdfResource instance in the constructor.
Set the PdfText instance's StartPage to one and PageCount to two.
Add a call to the PdfText instances Process method to call the pdf-text endpoint.
If successful, print the text as JSON to the console.
Add a call to run the ExtractText classes Run method.

<?php

require __DIR__ . '/vendor/autoload.php';

use DynamicPDF\Api\PdfResource;
use DynamicPDF\Api\PdfText;

class ExtractText
{
    private static string $BasePath = "C:/temp/dynamicpdf-api-samples/extract-text/";

    public static function Run()
    {
        $resource = new PdfResource(ExtractText::$BasePath . "fw4.pdf");
        $pdfText = new PdfText($resource);
        $pdfText->ApiKey ="DP.xxx-api-key-xxx";
        
        $pdfText->StartPage = 1;
        $pdfText->PageCount = 2;
        
        $response = $pdfText->Process();

        if($response->IsSuccessful)
        {
            echo ($response->JsonContent);
        } else {
            echo("Error: ");
            echo($response->StatusCode);
            echo($response->ErrorMessage);
        }
    }
}
ExtractText::Run();

Run the application php ExtractText.php and the JSON is output to the console.

Available as a GO package: https://pkg.go.dev/github.com/dynamicpdf-api/go-client

Ensure you have the required GO libraries.
Create a new file named pdf-text-example.go.
Add a main method.
Create a new PdfResource instance and load the path to the PDF.
Create a new PdfTextinstance and load the PdfResource instance in the constructor.
Set the PdfText instance's StartPage to one and PageCount to two.
Add a call to the PdfText instances Process method to call the pdf-text endpoint.
If successful, print the text as JSON to the console.
Run the application go run pdf-text-example.go and the JSON is output to the console.

package main

import (
    "fmt"
    "github.com/dynamicpdf-api/go-client/endpoint"
    "github.com/dynamicpdf-api/go-client/resource"
)

func main() {

    resource := resource.NewPdfResourceWithResourcePath("C:/temp/dynamicpdf-api-samples/fw4.pdf", "fw4.pdf")
    txt := endpoint.NewPdfText(resource,1,3)
    txt.Endpoint.BaseUrl = "https://api.dpdf.io/"
    txt.Endpoint.ApiKey = "DP.xxx-api-key-xxx"

    resp := txt.Process()
    res := <-resp
    
    if res.IsSuccessful() == true {
        fmt.Print(string(res.Content().Bytes()))
    }
}

Available at: pip install dynamicpdf-api

Ensure you have the required Python libraries.
Create a new file named PdfTextExample.py.
Add a run method.
Create a new PdfResource instance and load the path to the PDF.
Create a new PdfTextinstance and load the PdfResource instance in the constructor.
Set the PdfText instance's StartPage to one and PageCount to two.
Add a call to the PdfText instances Process method to call the pdf-text endpoint.
If successful, print the text as JSON to the console.
Run the application python PdfTextExample.py and the JSON is output to the console.

from dynamicpdf_api.pdf_text import PdfText
from dynamicpdf_api.pdf_resource import PdfResource

def run(api_key):
    resource = PdfResource("C:/temp/dynamicpdf-api-samples/pdf-info/fw4.pdf")
    pdf_text = PdfText(resource)
    pdf_text.api_key = api_key
    pdf_text.start_page=1
    pdf_text.page_count=2
    response = pdf_text.process()
    print(response.json_content)

if __name__ == "__main__":
    api_key = "DP.xxx-api-key-xxx"
    run(api_key)

In all six languages, the steps were similar. First, we created a new PdfResource instance by loading the path to the PDF via the constructor. Next, we created a new instance of the PdfText class, which abstracts the pdf-text endpoint. Then the PdfText instance prints the extracted text as JSON after processing. Finally, we called the Process method and printed the resultant JSON to the console.

Follow us on social media for latest news!

Required Resources​

Obtaining API Key​

Calling API Directly Using POST​

Make Request Using API​

Calling Endpoint Using Client Library​

Complete Source​