HTMLTextExtractor
class HTMLTextExtractor extends FileTextExtractor (View source)
Text extractor that uses php function strip_tags to get just the text. OK for indexing, not the best for readable text.
Traits
Provides extensions to this object to integrate it with standard config API methods.
A class that can be instantiated or replaced via DI
Config options
priority | int | Lower priority because its not the most clever HTML extraction. If there is something better, use it |
Properties
protected static | array | $sorted_extractor_classes | Cache of extractor class names, sorted by priority |
from FileTextExtractor |
Methods
Get a configuration accessor for this class. Short hand for Config::inst()->get($this->class, .....).
Gets the uninherited value for the given config option
An implementation of the factory method, allows you to create an instance of a class
Creates a class instance by the "singleton" design pattern.
Gets the list of prioritised extractor classes
Get the text file extractor for the given class
Given a File object, decide which extractor instance to use to handle it
Some text extractors (like pdftotext) may require a physical file to read from, so write the current file contents to a temp file and return its path
Extracts content from regex, by using strip_tags() combined with regular expressions to remove non-content tags like