Extract PDF's Text
Extract text from PDF documents using the
pdf-text
endpoint.The pdf-text
endpoint is for extracting text from PDF documents. In this tutorial we demonstrate just how easy it is to extract text from a PDF document using the pdf-text
endpoint.
First we use cURL to call the endpoint directly as a REST call. We then use the DynamicPDF API client libraries to call the endpoint programmatically.
Required Resources
To complete this tutorial, you must add the Extract Text (pdf-text endpoint) sample to your samples
folder in your cloud storage space using the File Manager. After adding the sample resources, you should see a samples/extract-text-pdf-text-endpoint
folder containing the resources for this tutorial.
Sample | Sample Folder | Resources |
---|---|---|
Extract Text | samples/extract-text-pdf-text-endpoint | fw4.pdf |
- From the File Manager, download
fw4.pdf
to your local system; here we assume/temp/dynamicpdf-api-samples/extract-text
. - After downloading, delete
fw4.pdf
from your cloud storage space using the File Manager.
Resource | Cloud/Local |
---|---|
fw4.pdf | local |
See Sample Resources for instructions on adding sample resources.
Obtaining API Key
This tutorial assumes a valid API key obtained from the DynamicPDF API's Portal
. Refer to the following for instructions on getting an API key.
If you are not familiar with the File Manager or Apps and API Keys, refer to the following tutorial and relevant Users Guide pages.
Calling API Directly Using POST
The pdf-text
endpoint takes a POST request. When using cURL, you specify the endpoint, the HTTP command, the API key and the local resources required. However, the pdf-text
endpoint also allows specifying the starting page and page count as query parameters.
Let's extract the text of only the first two pages of the PDF. Because we only wish to extract the text from the first two pages, in addition to sending the PDF and API key in the request, we must also send two query string parameters, startPage
and pageCount
.
Figure 1. To extract the first two pages of this PDF select start page and the number of pages.
Parameter | Parameter Type | Value |
---|---|---|
startPage | Query | 1 |
pageCount | Query | 2 |
Setting the startPage
and pageCount
both to zero (or omitting the querystring parameters) defaults to getting all pages of the PDF.
Make Request Using API
- Create the following cURL command where the PDF is sent to the endpoint as binary data and then execute the command.
- Add the
startPage
andpageCount
as querystring parameters to the request URL. - Specify the
Content-Type
asapplication/pdf
so the request knows to get the binary data as a PDF.
curl -X POST "https://api.dpdf.io/v1.0/pdf-text?startPage=1&pageCount=2"
-H "Content-Type: application/pdf"
-H "Authorization: Bearer DP.xxx-api-key-xxx"
--data-binary "@c:/dynamic-pdf-api-samples/extract-text/fw4.pdf"
- Execute the command and the following text is returned to the commandline as a JSON document.
[
{
"pageNumber": 1,
"text": "[DynamicPDF Evaluation] Form W-4\n(Rev. December 2020)\nDepartment of the Treasury \nInternal Revenue Service \nEmployee’s Withholding Certificate\n ▶ Complete Form W-4 so that your employer can withhold the correct federal income tax from your pay. \n ▶ Give Form W-4 to your empl ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"
},
{
"pageNumber": 2,
"text": "[DynamicPDF Evaluation] Form W-4 (2021) Page 2\nGeneral Instructions\nFuture Developments\nFor the latest information about developments related to \nForm W-4, such as legislation enacted after it was published, \ngo to www.irs.gov/FormW4 .\nPurpose of Form\nComplete Form W-4 so that yo ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"
}
]
Calling Endpoint Using Client Library
To simplify development, you can also use one of the DynamicPDF API client libraries. Use the client library of your choice to complete this tutorial section.
Complete Source
You can access the complete source for this project at one of the following GitHub projects.
Language | File Name | Location (package/namespace/etc.) | GitHub Project |
---|---|---|---|
Java | ExtractText.java | com.dynamicpdf.api.examples | https://github.com/dynamicpdf-api/java-client-examples |
C# | Program.cs | ExtractText | https://github.com/dynamicpdf-api/dotnet-client-examples |
Nodejs | ExtractText.js | nodejs-client-examples | https://github.com/dynamicpdf-api/nodejs-client-examples |
PHP | ExtractText.php | php-client-examples | https://github.com/dynamicpdf-api/php-client-examples |
GO | pdf-text-example.go | go-client-examples | https://github.com/dynamicpdf-api/go-client-examples/tree/main |
Python | PdfTextExample.py | python-client-examples | https://github.com/dynamicpdf-api/python-client-examples |
Click on the language tab of choice to view the tutorial steps for the particular language.
- C# (.NET)
- Java
- Node.js
- PHP
- GO
- Python
Available on NuGet:
Install-Package DynamicPDF.API
- Create a new Console App (.NET Core) project named
ExtractText
. - Add the DynamicPDF.API NuGet package.
- Create a new static method named
Run
. - Add the following code to the
Run
method. - Create a new
PdfResource
instance and load the path to the PDF. - Create a new
PdfText
instance and load thePdfResource
instance in the constructor. - Set the
PdfText
instance'sStartPage
to one andPageCount
to two. - Add a call to the
PdfText
instancesProcess
method to call thepdf-text
endpoint. - If successful, print the text as JSON to the console.
using DynamicPDF.Api;
using System;
namespace ExtractText
{
class Program
{
static void Main(string[] args)
{
Run("DP.xxx-api-key-xxx", "C:/temp/dynamicpdf-api-samples/extract-text/");
}
public static void Run(String apiKey, String basePath)
{
PdfResource resource = new PdfResource(basePath + "fw4.pdf");
PdfText pdfText = new PdfText(resource);
pdfText.ApiKey = apiKey;
pdfText.StartPage = 1;
pdfText.PageCount = 2;
PdfTextResponse response = pdfText.Process();
if (response.IsSuccessful)
{
Console.WriteLine((response.JsonContent));
} else
{
Console.WriteLine(response.ErrorJson);
}
}
}
}
Available on NPM:
npm i @dynamicpdf/api
- Use npm to install the DynamicPDF API module.
- Create a new class named
ExtractText
. - Create a static
Run
method. - Add the following code to the
Run
method. - Create a new
PdfResource
instance and load the path to the PDF. - Create a new
PdfText
instance and load thePdfResource
instance in the constructor. - Set the
PdfText
instance'sstartPage
to one andpageCount
to two. - Add a call to the
PdfText
instancesprocess
method to call thepdf-text
endpoint. - If successful, print the text as JSON to the console.
- Add a call the the
ExtractText.Run
method.
import {
PdfResource,
PdfText
} from "@dynamicpdf/api"
export class ExtractText {
static async Run() {
var resource = new PdfResource("C:/temp/dynamicpdf-api-samples/extract-text/fw4.pdf");
var pdfText = new PdfText(resource);
pdfText.apiKey = "DP.xxx-api-key-xxx";
pdfText.startPage = 1;
pdfText.pageCount = 2;
var res = await pdfText.process();
if (res.isSuccessful) {
console.log(JSON.parse(res.content));
} else {
console.log(res.errorJson);
}
}
}
await ExtractText.Run();
- Run the application
node ExtractText.js
and the JSON is output to the console.
Available on Maven:
https://search.maven.org/search?q=g:com.dynamicpdf.api
<dependency>
<groupId>com.dynamicpdf.api</groupId>
<artifactId>dynamicpdf-api</artifactId>
<version>1.0.0</version>
</dependency>
- Create a new Maven project and add the DynamicPDF API as a dependency.
- Create a new class named
ExtractText
with amain
method. - Create a new method named
Run
. - Add the
Run
method call tomain
. - Create a new
PdfResource
instance and load the path to the PDF. - Create a new
PdfText
instance and load thePdfResource
instance in the constructor. - Set the
PdfText
instance'sStartPage
to one andPageCount
to two. - Add a call to the
PdfText
instancesprocess
method to call thepdf-text
endpoint. - If successful, print the text as JSON to the console.
- Run the application and the text output, written as JSON, is written to the console.
package com.dynamicpdf.api.examples;
import com.dynamicpdf.api.PdfResource;
import com.dynamicpdf.api.PdfText;
import com.dynamicpdf.api.PdfTextResponse;
public class ExtractingText {
public static void main(String[] args) {
ExtractingText.Run("DP.xxx-api-key-xxx",
"C:/temp/dynamicpdf-api-samples/extract-text/");
}
public static void Run(String apiKey, String basePath) {
PdfResource resource = new PdfResource(basePath + "fw4.pdf");
PdfText pdfText = new PdfText(resource);
pdfText.setApiKey(apiKey);
pdfText.setStartPage(1);
pdfText.setPageCount(2);
PdfTextResponse response = pdfText.process();
if(response.getIsSuccessful()) {
System.out.println(response.getJsonContent());
} else {
System.out.println(response.getErrorJson());
}
}
}
Available as a Composer package:
composer require dynamicpdf/api
- Use composer to ensure you have the required PHP libraries.
- Create a new class named
ExtractText
. - Add a
Run
method. - Create a new
PdfResource
instance and load the path to the PDF. - Create a new
PdfText
instance and load thePdfResource
instance in the constructor. - Set the
PdfText
instance'sStartPage
to one andPageCount
to two. - Add a call to the
PdfText
instancesProcess
method to call thepdf-text
endpoint. - If successful, print the text as JSON to the console.
- Add a call to run the
ExtractText
classesRun
method.
<?php
require __DIR__ . '/vendor/autoload.php';
use DynamicPDF\Api\PdfResource;
use DynamicPDF\Api\PdfText;
class ExtractText
{
private static string $BasePath = "C:/temp/dynamicpdf-api-samples/extract-text/";
public static function Run()
{
$resource = new PdfResource(ExtractText::$BasePath . "fw4.pdf");
$pdfText = new PdfText($resource);
$pdfText->ApiKey ="DP.xxx-api-key-xxx";
$pdfText->StartPage = 1;
$pdfText->PageCount = 2;
$response = $pdfText->Process();
if($response->IsSuccessful)
{
echo ($response->JsonContent);
} else {
echo("Error: ");
echo($response->StatusCode);
echo($response->ErrorMessage);
}
}
}
ExtractText::Run();
- Run the application
php ExtractText.php
and the JSON is output to the console.
Available as a GO package: https://pkg.go.dev/github.com/dynamicpdf-api/go-client
- Ensure you have the required GO libraries.
- Create a new file named
pdf-text-example.go
. - Add a
main
method. - Create a new
PdfResource
instance and load the path to the PDF. - Create a new
PdfText
instance and load thePdfResource
instance in the constructor. - Set the
PdfText
instance'sStartPage
to one andPageCount
to two. - Add a call to the
PdfText
instancesProcess
method to call thepdf-text
endpoint. - If successful, print the text as JSON to the console.
- Run the application
go run pdf-text-example.go
and the JSON is output to the console.
package main
import (
"fmt"
"github.com/dynamicpdf-api/go-client/endpoint"
"github.com/dynamicpdf-api/go-client/resource"
)
func main() {
resource := resource.NewPdfResourceWithResourcePath("C:/temp/dynamicpdf-api-samples/fw4.pdf", "fw4.pdf")
txt := endpoint.NewPdfText(resource,1,3)
txt.Endpoint.BaseUrl = "https://api.dpdf.io/"
txt.Endpoint.ApiKey = "DP.xxx-api-key-xxx"
resp := txt.Process()
res := <-resp
if res.IsSuccessful() == true {
fmt.Print(string(res.Content().Bytes()))
}
}
Available at: pip install dynamicpdf-api
- Ensure you have the required Python libraries.
- Create a new file named
PdfTextExample.py
. - Add a
run
method. - Create a new
PdfResource
instance and load the path to the PDF. - Create a new
PdfText
instance and load thePdfResource
instance in the constructor. - Set the
PdfText
instance'sStartPage
to one andPageCount
to two. - Add a call to the
PdfText
instancesProcess
method to call thepdf-text
endpoint. - If successful, print the text as JSON to the console.
- Run the application
python PdfTextExample.py
and the JSON is output to the console.
from dynamicpdf_api.pdf_text import PdfText
from dynamicpdf_api.pdf_resource import PdfResource
def run(api_key):
resource = PdfResource("C:/temp/dynamicpdf-api-samples/pdf-info/fw4.pdf")
pdf_text = PdfText(resource)
pdf_text.api_key = api_key
pdf_text.start_page=1
pdf_text.page_count=2
response = pdf_text.process()
print(response.json_content)
if __name__ == "__main__":
api_key = "DP.xxx-api-key-xxx"
run(api_key)
In all six languages, the steps were similar. First, we created a new PdfResource
instance by loading the path to the PDF via the constructor. Next, we created a new instance of the PdfText
class, which abstracts the pdf-text
endpoint. Then the PdfText
instance prints the extracted text as JSON after processing. Finally, we called the Process
method and printed the resultant JSON to the console.