-
Notifications
You must be signed in to change notification settings - Fork 6
module__org.bibliome.alvisnlp.modules.TextFileReader
#org.bibliome.alvisnlp.modules.TextFileReader
Reads files and adds a document in the corpus for each file.
org.bibliome.alvisnlp.modules.TextFileReader reads file(s) from sourcePath and creates a document in the corpus for each file. The identifier of the created document is the absolute path of the corresponding file. The created document has a single section named section whose contents is the contents of the corresponding file.
If sourcePath is a path to a file, then org.bibliome.alvisnlp.modules.TextFileReader will read this file. If sourcePath is a path to a directory, then org.bibliome.alvisnlp.modules.TextFileReader will read the files in this directory. If recursive is set to true, then the files in sub-directories will be read recursively. org.bibliome.alvisnlp.modules.TextFileReader only reads files whose name match acceptPattern. If acceptPattern is not set, then org.bibliome.alvisnlp.modules.TextFileReader reads all files.
If linesLimit is set, then org.bibliome.alvisnlp.modules.TextFileReader creates a new document for each set of lines. For instance, if linesLimit is set to 10 and a file contains 25 lines, then 3 documents are created: two containing 10 lines and one containing the las 5 lines.
Files are read using the same encoding charset.
The created documents will all have the features defined in constantDocumentFeatures. The unique section will have the features defined in constantSectionFeatures.
Optional
Type: SourceStream
Path to the source directory or source file.
Optional
Type: Mapping
Constant features to add to each document created by this module
Optional
Type: Mapping
Constant features to add to each section created by this module
Optional
Type: Integer
Maximum number of lines per document.
Optional
Type: Integer
Maximum number of characters per document. No limit if not set.
Default value: false
Type: Boolean
Use the filename base name instead of the full path as document identifier.
Default value: UTF-8
Type: String
Character set of the input files.
Default value: contents
Type: String
Name of the single section containing the whole contents of a file.