-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test lists v1.5 #1720
Open
hellais
wants to merge
11
commits into
citizenlab:master
Choose a base branch
from
ooni:test-lists-v2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Test lists v1.5 #1720
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
8538d87
Update README
hellais 016155a
Add a MVP of the spec
hellais 64a5e1a
Update spec
hellais 2aff021
Revert "Update README"
hellais 3a5c39e
Boilerplate for new codebase
hellais c10ef9d
Refactor lint-lists script to run as part of CLI
hellais 57b6cdd
Refactor lint-lists script
hellais 87f08a1
Add support for converting notes field
hellais 289707d
Enable notes field CLI flag
hellais 0828cca
Improve note fixing
hellais 98ec26d
Add support for quote fixing
hellais File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
## Test Lists v1 data format | ||
|
||
The goal of this section is to outline the current dataformat for the testing | ||
lists. | ||
|
||
Ideally we would enrich this data format spec with also some additional notes | ||
on the existing pain points and what are the current limitations. | ||
|
||
### v1 data format | ||
|
||
The testing lists are broken down into CSV files, which are named as: | ||
* `global.csv` for testing lists that apply to all countries | ||
* `[country_code].csv` for country specific lists, where `country_code` is the | ||
lowercase | ||
[ISO3166](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) alpha | ||
2 country code. The only exception is the `cis` category code that is | ||
for Commonwealth of Independent States nations. | ||
|
||
Each CSV file contains the following columns: | ||
|
||
* `url` - Full URL of the resource, which must match the following regular expression: | ||
``` | ||
re.compile( | ||
r'^(?:http)s?://' # http:// or https:// | ||
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain... | ||
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip | ||
r'(?::\d+)?' # optional port | ||
r'(?:/?|[/?]\S+)$', re.IGNORECASE) | ||
``` | ||
* `category_code` - Category code (see current category codes) | ||
* `category_description` - Description of the category | ||
* `date_added` - [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) timestamp of when it was added to the list in the format `YYYY-MM-DD` | ||
* `source` - opaque string representing the name of the person that added it to the list | ||
* `notes` - opaque string with notes about this string | ||
|
||
### v1 category codes | ||
|
||
* Alcohol & Drugs,ALDR | ||
* Religion,REL | ||
* Pornography,PORN | ||
* Provocative Attire,PROV | ||
* Political Criticism,POLR | ||
* Human Rights Issues,HUMR | ||
* Environment,ENV | ||
* Terrorism and Militants,MILX | ||
* Hate Speech,HATE | ||
* News Media,NEWS | ||
* Sex Education,XED | ||
* Public Health,PUBH | ||
* Gambling,GMB | ||
* Anonymization and circumvention tools,ANON | ||
* Online Dating,DATE | ||
* Social Networking,GRP | ||
* LGBT,LGBT | ||
* File-sharing,FILE | ||
* Hacking Tools,HACK | ||
* Communication Tools,COMT | ||
* Media sharing,MMED | ||
* Hosting and Blogging Platforms,HOST | ||
* Search Engines,SRCH | ||
* Gaming,GAME | ||
* Culture,CULTR | ||
* Economics,ECON | ||
* Government,GOVT | ||
* E-commerce,COMM | ||
* Control content,CTRL | ||
* Intergovernmental Organizations,IGO | ||
* Miscellaneous content,MISC | ||
|
||
## v1.5 data format | ||
|
||
The goal of the v1.5 data format is to come up with an incremental set of | ||
changes to the lists formats such that it's possible to relatively easily | ||
backport changes from upstream while we work on fully migrating over to the new | ||
format. | ||
|
||
Ideally it would include only the addition of new columns, without any | ||
drammatic changes to minimize the likelyhood of conflicts when it's merged from | ||
upstream. | ||
|
||
* `url` - Full URL of the resource | ||
* `category_code` - Category code (see current category codes) | ||
* `category_description` - Description of the category | ||
* `date_added` - ISO timestamp of when it was added | ||
* `source` - string representing the name of the person that added it | ||
* `notes` - a JSON string representing metadata for the URL (see URL Meta below) | ||
|
||
### URL Meta | ||
|
||
URL meta is a JSON encoded metadata column that expresses metadata related to | ||
the a URL that is relevant to analysts permorning data analysis. | ||
|
||
It should be extensible without needing to add new columns (adding or changing | ||
columns has the potential of breaking parsers of CSV). | ||
|
||
This field is optional and parsers should not expect it to be present or it | ||
containing any of the specific keys defined below. | ||
|
||
Defined keys | ||
* `notes`: value coming from the existing notes column | ||
* `context_*`: values representing context that's specific to the URL | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
[build-system] | ||
requires = ["hatchling"] | ||
build-backend = "hatchling.build" | ||
|
||
[project] | ||
name = "test-lists" | ||
dynamic = ["version"] | ||
description = '' | ||
readme = "README.md" | ||
requires-python = ">=3.8" | ||
license = "MPL-2.0" | ||
keywords = [] | ||
authors = [{ name = "Arturo Filastò", email = "[email protected]" }] | ||
classifiers = [ | ||
"Development Status :: 4 - Beta", | ||
"Programming Language :: Python", | ||
"Programming Language :: Python :: 3.8", | ||
"Programming Language :: Python :: 3.9", | ||
"Programming Language :: Python :: 3.10", | ||
"Programming Language :: Python :: 3.11", | ||
"Programming Language :: Python :: 3.12", | ||
"Programming Language :: Python :: Implementation :: CPython", | ||
"Programming Language :: Python :: Implementation :: PyPy", | ||
] | ||
dependencies = [] | ||
|
||
[project.urls] | ||
Documentation = "https://github.com/ooni/test-lists#readme" | ||
Issues = "https://github.com/ooni/test-lists/issues" | ||
Source = "https://github.com/ooni/test-lists" | ||
|
||
[tool.hatch.version] | ||
path = "src/test_lists/__about__.py" | ||
|
||
[tool.hatch.envs.default] | ||
dependencies = ["coverage[toml]>=6.5", "pytest", "click"] | ||
path = ".venv/" | ||
|
||
[tool.hatch.envs.default.scripts] | ||
lint-lists = "python -m test_lists.cli lint-lists {args}" | ||
test = "pytest {args:tests}" | ||
test-cov = "coverage run -m pytest {args:tests}" | ||
cov-report = ["- coverage combine", "coverage report"] | ||
cov = ["test-cov", "cov-report"] | ||
|
||
[[tool.hatch.envs.all.matrix]] | ||
python = ["3.8", "3.9", "3.10", "3.11", "3.12"] | ||
|
||
[tool.hatch.envs.types] | ||
dependencies = ["mypy>=1.0.0"] | ||
[tool.hatch.envs.types.scripts] | ||
check = "mypy --install-types --non-interactive {args:src/test_lists tests}" | ||
|
||
[tool.coverage.run] | ||
source_pkgs = ["test_lists", "tests"] | ||
branch = true | ||
parallel = true | ||
omit = ["src/test_lists/__about__.py"] | ||
|
||
[tool.coverage.paths] | ||
test_lists = ["src/test_lists", "*/test-lists/src/test_lists"] | ||
tests = ["tests", "*/test-lists/tests"] | ||
|
||
[tool.coverage.report] | ||
exclude_lines = ["no cov", "if __name__ == .__main__.:", "if TYPE_CHECKING:"] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add note about the quoting format and the fact that JSON format is determined by peaking the first byte which should be
{