Skip to content

Commit

Permalink
Merge pull request #11 from Coleridge-Initiative/fix_pdf
Browse files Browse the repository at this point in the history
working toward #6
  • Loading branch information
ceteri authored Dec 24, 2019
2 parents fff1824 + 3acb7d9 commit f6648af
Show file tree
Hide file tree
Showing 4 changed files with 50,593 additions and 43,616 deletions.
4 changes: 3 additions & 1 deletion DOWNLOAD.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ from the public S3 bucket.
Download the corpus PDFs and other resource files:

```
python bin/download_resources.py
python bin/download_resources.py --logger errors.txt
```

The PDF files get stored in the `resources/pub/pdf` subdirectory.
Expand All @@ -47,6 +47,8 @@ java -jar $SPJAR -o ./resources/pub/json ./resources/pub/pdf
That command will download multiple resources from the Allan AI public
datastore, which may take several minutes.

TODO: replace this step with use of a containerized `SPv2` server.


## Upload PDF and JSON files

Expand Down
13 changes: 7 additions & 6 deletions bin/download_resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,13 +171,14 @@ def enum_dat_resources (corpus: dict, output_path: Path, force_download: bool) -
downloaded_before = e_id in downloaded_dat_id

if force_download or not downloaded_before:
res_url = entity["foaf:page"]["@value"]
if "foaf:page" in entity:
res_url = entity["foaf:page"]["@value"]

if res_url.startswith("http://example.com"):
# ignore these placeholder URLs
continue
else:
todo.append(["unknown", e_id, res_url, dat_path])
if res_url.startswith("http://example.com"):
# ignore these placeholder URLs
continue
else:
todo.append(["unknown", e_id, res_url, dat_path])

return dat_path, todo

Expand Down
Loading

0 comments on commit f6648af

Please sign in to comment.