Where are the crawl files stored in the Heritrix web crawler

I want to know where the scanned files are stored in the Heritrix crawler ...

thanks and go ahead

+2


a source to share


1 answer


To the developer guide :

By default, heritrix writes all of its crawl data to disk using the ARCWriterProcessor . This processor writes the found crawl content as Internet archive files. The ARC file format is described here: Arc File Format . Heritrix writes ARC v1 files 1 .

ARC files are located in arcs/

your crawl instance folder . You can change the location in the heritrix web GUI settings.



Instead of the default ARCWriterProcessor, you can set it to WARCWriterProcessor (WARC files), MirrorWriterProcessor (no container at all), or Kw3WriterProcessor . AFAIK, you could even install multiple authors. Note that when you select MirrorWriterProcessor, not all files can be written to disk, depending on the file system you are using to write the files.

[1] Archive on the Internet Archive ARC

0


a source







All Articles