Artifacts using WebArchive Commons (35)
Sort:
popular
|
newest
2. Heritrix 3: 'commons' Subproject (utility Classes)10 usages
org.archive.heritrix » heritrix-commonsApacheLGPL
The Archive Commons Code Libraries project contains general Java utility
libraries, as used by the Heritrix crawler and other projects.
Last Release on Sep 23, 2021
NLPA is a framework designed to operate in conjuction with BDP4J
(https://github.com/sing-group/bdp4j) and able to extract texts from
Twitter, Youtube Comments, text files, raw email files (.eml) or WARC
(Web Archive) files. The extracted text can be preprocessed into a
Dataset using task (org.bdp4j.pipe.Pipe) definitions. This framework
incorporates more than 30 preprocessing tasks to transform the text.
Last Release on Jul 26, 2021