class PDFTextExtractor extends FileTextExtractor (View source)

Text extractor that calls pdftotext to do the conversion.

Traits

Provides extensions to this object to integrate it with standard config API methods.

A class that can be instantiated or replaced via DI

Config options

priority int

Set priority from 0-100.

from  FileTextExtractor
binary_location string

Set to bin path this extractor can execute

search_binary_locations array

Used if binary_location isn't set.

Properties

protected static array $sorted_extractor_classes

Cache of extractor class names, sorted by priority

from  FileTextExtractor

Methods

public static 
config()

Get a configuration accessor for this class. Short hand for Config::inst()->get($this->class, .....).

public
mixed
uninherited(string $name)

Gets the uninherited value for the given config option

public static 
create(mixed ...$args)

An implementation of the factory method, allows you to create an instance of a class

public static 
singleton(string $class = null)

Creates a class instance by the "singleton" design pattern.

protected static 
array
get_extractor_classes()

Gets the list of prioritised extractor classes

protected static 
get_extractor(string $class)

Get the text file extractor for the given class

public static 
for_file(File|string $file)

Given a File object, decide which extractor instance to use to handle it

protected static 
string
getPathFromFile(File $file)

Some text extractors (like pdftotext) may require a physical file to read from, so write the current file contents to a temp file and return its path

public
bool
isAvailable()

Checks if the extractor is supported on the current environment, for example if the correct binaries or libraries are available.

public
bool
supportsExtension(string $extension)

Determine if this extractor supports the given extension.

public
bool
supportsMime(string $mime)

Determine if this extractor supports the given mime type.

public
string
getContent(File|string $file)

Given a File instance, extract the contents as text.

protected
string
bin(string $program = '')

Accessor to get the location of the binary

protected
string
getRawOutput(File|string $file)

Invoke pdftotext with the given File object

protected
string
cleanupLigatures(string $input)

Removes utf-8 ligatures.

Details

static Config_ForClass config()

Get a configuration accessor for this class. Short hand for Config::inst()->get($this->class, .....).

Return Value

Config_ForClass

mixed uninherited(string $name)

Gets the uninherited value for the given config option

Parameters

string $name

Return Value

mixed

static Injectable create(mixed ...$args)

An implementation of the factory method, allows you to create an instance of a class

This method will defer class substitution to the Injector API, which can be customised via the Config API to declare substitution classes.

This can be called in one of two ways - either calling via the class directly, or calling on Object and passing the class name as the first parameter. The following are equivalent: $list = DataList::create(SiteTree::class); $list = SiteTree::get();

Parameters

mixed ...$args

Return Value

Injectable

static Injectable singleton(string $class = null)

Creates a class instance by the "singleton" design pattern.

It will always return the same instance for this class, which can be used for performance reasons and as a simple way to access instance methods which don't rely on instance data (e.g. the custom SilverStripe static handling).

Parameters

string $class

Optional classname to create, if the called class should not be used

Return Value

Injectable

The singleton instance

static protected array get_extractor_classes()

Gets the list of prioritised extractor classes

Return Value

array

static protected FileTextExtractor get_extractor(string $class)

Get the text file extractor for the given class

Parameters

string $class

Return Value

FileTextExtractor

static FileTextExtractor|null for_file(File|string $file)

Given a File object, decide which extractor instance to use to handle it

Parameters

File|string $file

Return Value

FileTextExtractor|null

static protected string getPathFromFile(File $file)

Some text extractors (like pdftotext) may require a physical file to read from, so write the current file contents to a temp file and return its path

Parameters

File $file

Return Value

string

Exceptions

Exception

bool isAvailable()

Checks if the extractor is supported on the current environment, for example if the correct binaries or libraries are available.

Return Value

bool

bool supportsExtension(string $extension)

Determine if this extractor supports the given extension.

If support is determined by mime/type only, then this should return false.

Parameters

string $extension

Return Value

bool

bool supportsMime(string $mime)

Determine if this extractor supports the given mime type.

Will only be called if supportsExtension returns false.

Parameters

string $mime

Return Value

bool

string getContent(File|string $file)

Given a File instance, extract the contents as text.

Parameters

File|string $file

Either the File instance, or a file path for a file to load

Return Value

string

protected string bin(string $program = '')

Accessor to get the location of the binary

Parameters

string $program

Name of binary

Return Value

string

protected string getRawOutput(File|string $file)

Invoke pdftotext with the given File object

Parameters

File|string $file

Return Value

string Output

Exceptions

Exception

protected string cleanupLigatures(string $input)

Removes utf-8 ligatures.

Parameters

string $input

Return Value

string