# Create a TAR archive with checksum
tar -cvzf lyxitsxlilix-archive-2024-04-11.tar.gz \
lyxitsxlilix/ \
site_normalized.json \
lyxitsxlilix.warc.gz
# Generate SHA‑256 checksum
sha256sum lyxitsxlilix-archive-2024-04-11.tar.gz > SHA256SUMS.txt
Upload the tar.gz and SHA256SUMS.txt to the chosen repository (e.g., DSpace) and fill out the metadata fields: title, creator, date, rights statement, and a note describing the permissions obtained.
LyxitsXLilix Siterip (LSR) is proposed as a high-performance, modular site-ripping framework targeting researchers, archivists, and offline users. Goals: fidelity in reproducing pages, respect for site constraints, scalability, and privacy-preserving operation. lyxitsxlilix siterip
# Using the command‑line tool "webrecorder-cli"
webrecorder-cli capture \
--url https://lyxitsxlilix.org/ \
--output lyxitsxlilix.warc.gz \
--depth 5 \
--delay 2
| Phase | Toolset | Rationale |
|-------|----------|-----------|
| Discovery | scrapy + custom spider | Handles dynamic URL generation from API endpoints. |
| Rendering | Playwright (headless Chromium) | Captures JavaScript‑rendered content (e.g., forum pagination). |
| Asset Collection | wget with --mirror and --span-hosts | Bulk download of static assets, respecting domain boundaries. |
| Metadata Harvest | Webrecorder (WARC export) | Guarantees a standards‑compliant archive of HTTP transactions. |
| Post‑Processing | warcio + custom Python scripts | Normalizes URLs, rewrites links to relative paths, removes dead links. |
| Validation | linkchecker + manual spot‑checks | Ensures the offline site is navigable. | # Create a TAR archive with checksum tar