Skip to main content

Extract PDF's Text


Extract text from PDF documents using the `pdf-text` endpoint.

The pdf-text endpoint is for extracting text from PDF documents. In this tutorial we demonstrate just how easy it is to extract text from a PDF document using the pdf-text endpoint.

First we use cURL to call the endpoint directly as a REST call. We then use the DynamicPDF API client libraries to call the endpoint programmatically.

Required Resources

To complete this tutorial, you must add the Extract Text (pdf-text endpoint) sample to your samples folder in your cloud storage space using the File Manager. After adding the sample resources, you should see a samples/extract-text-pdf-text-endpoint folder containing the resources for this tutorial.

SampleSample FolderResources
Extract Textsamples/extract-text-pdf-text-endpointfw4.pdf
  • From the File Manager, download fw4.pdf to your local system; here we assume /temp/dynamicpdf-api-samples/extract-text.
  • After downloading, delete fw4.pdf from your cloud storage space using the File Manager.
ResourceCloud/Local
fw4.pdflocal
tip

See Sample Resources for instructions on adding sample resources.

Obtaining API Key

This tutorial assumes a valid API key obtained from the DynamicPDF API's Portal. Refer to the following for instructions on getting an API key.

tip

If you are not familiar with the File Manager or Apps and API Keys, refer to the following tutorial and relevant Users Guide pages.

Calling API Directly Using POST

The pdf-text endpoint takes a POST request. When using cURL, you specify the endpoint, the HTTP command, the API key and the local resources required. However, the pdf-text endpoint also allows specifying the starting page and page count as query parameters.

Let's extract the text of only the first two pages of the PDF. Because we only wish to extract the text from the first two pages, in addition to sending the PDF and API key in the request, we must also send two query string parameters, startPage and pageCount.

Figure 1. To extract the first two pages of this PDF select start page and the number of pages.

ParameterParameter TypeValue
startPageQuery1
pageCountQuery2
info

Setting the startPage and pageCount both to zero (or omitting the querystring parameters) defaults to getting all pages of the PDF.

Make Request Using API

  • Create the following cURL command where the PDF is sent to the endpoint as binary data and then execute the command.
  • Add the startPage and pageCount as querystring parameters to the request URL.
  • Specify the Content-Type as application/pdf so the request knows to get the binary data as a PDF.
curl -X POST "https://api.dynamicpdf.com/v1.0/pdf-text?startPage=1&pageCount=2"
-H "Content-Type: application/pdf"
-H "Authorization: Bearer DP.xxx-api-key-xxx"
--data-binary "@c:/dynamic-pdf-api-samples/extract-text/fw4.pdf"
  • Execute the command and the following text is returned to the commandline as a JSON document.
[
{
"pageNumber": 1,
"text": "[DynamicPDF Evaluation] Form W-4\n(Rev. December 2020)\nDepartment of the Treasury \nInternal Revenue Service \nEmployee’s Withholding Certificate\n ▶ Complete Form W-4 so that your employer can withhold the correct federal income tax from your pay. \n ▶ Give Form W-4 to your empl ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"
},
{
"pageNumber": 2,
"text": "[DynamicPDF Evaluation] Form W-4 (2021) Page 2\nGeneral Instructions\nFuture Developments\nFor the latest information about developments related to \nForm W-4, such as legislation enacted after it was published, \ngo to www.irs.gov/FormW4 .\nPurpose of Form\nComplete Form W-4 so that yo ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"
}
]

Calling Endpoint Using Client Library

To simplify development, you can also use one of the DynamicPDF API client libraries. Use the client library of your choice to complete this tutorial section.

Complete Source

You can access the complete source for this project at one of the following GitHub projects.

LanguageFile NameLocation (package/namespace/etc.)GitHub Project
JavaExtractText.javacom.dynamicpdf.api.exampleshttps://github.com/dynamicpdf-api/java-client-examples
C#Program.csExtractTexthttps://github.com/dynamicpdf-api/dotnet-client-examples
NodejsExtractText.jsnodejs-client-exampleshttps://github.com/dynamicpdf-api/nodejs-client-examples
PHPExtractText.phpphp-client-exampleshttps://github.com/dynamicpdf-api/php-client-examples
GOpdf-text-example.gogo-client-exampleshttps://github.com/dynamicpdf-api/go-client-examples/tree/main
PythonPdfTextExample.pypython-client-exampleshttps://github.com/dynamicpdf-api/python-client-examples
tip

Click on the language tab of choice to view the tutorial steps for the particular language.

Available on NuGet:

Install-Package DynamicPDF.API
  • Create a new Console App (.NET Core) project named ExtractText.
  • Add the DynamicPDF.API NuGet package.
  • Create a new static method named Run.
  • Add the following code to the Run method.
  • Create a new PdfResource instance and load the path to the PDF.
  • Create a new PdfTextinstance and load the PdfResource instance in the constructor.
  • Set the PdfText instance's StartPage to one and PageCount to two.
  • Add a call to the PdfText instances Process method to call the pdf-text endpoint.
  • If successful, print the text as JSON to the console.
using DynamicPDF.Api;
using System;

namespace ExtractText
{
class Program
{
static void Main(string[] args)
{
Run("DP.xxx-api-key-xxx", "C:/temp/dynamicpdf-api-samples/extract-text/");
}

public static void Run(String apiKey, String basePath)
{
PdfResource resource = new PdfResource(basePath + "fw4.pdf");
PdfText pdfText = new PdfText(resource);
pdfText.ApiKey = apiKey;
pdfText.StartPage = 1;
pdfText.PageCount = 2;
PdfTextResponse response = pdfText.Process();
if (response.IsSuccessful)
{
Console.WriteLine((response.JsonContent));
} else
{
Console.WriteLine(response.ErrorJson);
}
}
}
}

In all six languages, the steps were similar. First, we created a new PdfResource instance by loading the path to the PDF via the constructor. Next, we created a new instance of the PdfText class, which abstracts the pdf-text endpoint. Then the PdfText instance prints the extracted text as JSON after processing. Finally, we called the Process method and printed the resultant JSON to the console.

   Follow us on social media for latest news!