Indexed Artifacts (29.3M)

Popular Categories

Artifacts using webarchive-commons version 1.1.8

The Archive Commons Code Libraries project contains general Java utility libraries, as used by the Heritrix crawler and other projects.
Last Release on Jul 27, 2022
WARC Indexer
Last Release on Nov 27, 2020
LOCKSS repository core infrastructure
Last Release on Jan 25, 2022
OpenWayback Core Java Classes
Last Release on Mar 19, 2021
Wayback CDX Server Core Java Classes
Last Release on Mar 19, 2021
Digipres Tika
Last Release on Nov 27, 2020
WARC Hadoop Recordreaders
Last Release on Nov 27, 2020
NLPA is a framework designed to operate in conjuction with BDP4J (https://github.com/sing-group/bdp4j) and able to extract texts from Twitter, Youtube Comments, text files, raw email files (.eml) or WARC (Web Archive) files. The extracted text can be preprocessed into a Dataset using task (org.bdp4j.pipe.Pipe) definitions. This framework incorporates more than 30 preprocessing tasks to transform the text.
Last Release on Jul 26, 2021
WARC Discovery
Last Release on Nov 28, 2020
NetarchiveSuite Wayback
Last Release on Aug 8, 2022