class HTMLTextExtractor extends FileTextExtractor (View source)

Text extractor that uses php function strip_tags to get just the text. OK for indexing, not the best for readable text.

Traits

Provides extensions to this object to integrate it with standard config API methods.

A class that can be instantiated or replaced via DI

Config options

priority int

Lower priority because its not the most clever HTML extraction. If there is something better, use it

Properties

protected static array $sorted_extractor_classes

Cache of extractor class names, sorted by priority

from  FileTextExtractor

Methods

public static 
config()

Get a configuration accessor for this class. Short hand for Config::inst()->get($this->class, .....).

public
mixed
stat(string $name) deprecated

Get inherited config value

public
mixed
uninherited(string $name)

Gets the uninherited value for the given config option

public
$this
set_stat(string $name, mixed $value) deprecated

Update the config value for a given property

public static 
create(mixed ...$args)

An implementation of the factory method, allows you to create an instance of a class

public static 
singleton(string $class = null)

Creates a class instance by the "singleton" design pattern.

protected static 
array
get_extractor_classes()

Gets the list of prioritised extractor classes

protected static 
get_extractor(string $class)

Get the text file extractor for the given class

public static 
for_file(File|string $file)

Given a File object, decide which extractor instance to use to handle it

protected static 
string
getPathFromFile(File $file)

Some text extractors (like pdftotext) may require a physical file to read from, so write the current file contents to a temp file and return its path

public
bool
isAvailable()

No description

public
bool
supportsExtension(string $extension)

No description

public
bool
supportsMime(string $mime)

No description

public
string
getContent(File|string $file)

Extracts content from regex, by using strip_tags() combined with regular expressions to remove non-content tags like