This is a language metadata toolkit to make it easier to work with a large variety of metadata from a single interface. Currently this metadata includes:
- Official identifiers:
iso-639-3
,bcp-47
, Glottocode, Wikidata, Wikipedia - Non-official identifiers: NLLB-style codes (
zho_Hans
,zho_Hant
, etc.)1 - Geographic information
- Speaker information
- Writing systems
- Names
Planned:
- Phylogenetic information
- Linking with CLLD datasets
- Typological features:
grambank
,wals
,uriel
- A graph-based view to traverse between languoids (family trees, geo areas, shared features etc.)
- Historical languages
Name: Qwanqwa is a phonetic spelling of 'ቋንቋ', which means language in Amharic; qq
is nice and short to type.
In qq
, language-like entities as referred to as Languoids, since it includes dialects, macro-languages and language areas.
Not all languoids have coverage for all features.
Number of languoids: 7511
pip install git+https://github.com/WPoelman/qwanqwa
Important: qq
makes a strict distinction between None
(don't know) and False
(it is not the case). Make sure to keep this in mind when checking boolean values for truthiness, so if you're interested in missing values for example, avoid if not script.is_canonical:
, but instead explicitly check if script.is_canonical is None:
.
from qq import LanguageData, TagType
# Load from the pre-compiled database
ld = LanguageData.from_db()
# Access Languoid info using whatever official tag you have
nl1 = ld.get('nl', tag_type=TagType.BCP_47_CODE)
nl2 = ld.get('nld', tag_type=TagType.ISO_639_3_CODE)
# The `guess` method tries all known official tag types,
# be careful though since this can give unexpected resutls.
nl3 = ld.guess('dut') # happens to be TagType.ISO_639_2_B
# In this case, these will give the same Languoid
assert nl1 == nl2 == nl3
> True
am = ld.get('am') # Default tag_type is BCP_47
# Language identifiers
am.iso_639_3_code
> 'amh'
am.glottocode
> 'amha1245'
am.wikidata_id
> 'Q28244',
am.wikipedia_id
> 'am'
# Also some non-standard identifiers, often used in NLP research
am.nllb_style_codes_iso_639_3
> ['amh_Ethi']
am.nllb_style_codes_bcp_47
> ['am_Ethi']
# Names in different languages
am.endonym
> 'አማርኛ'
am.name_data['fr'].name
> 'amharique'
# English description
am.language_description.description
> 'Semitic language of Ethiopia'
# Endangerment status
am.endangerment_status
> <Endangerment.SAFE: 'SAFE'>
# Scripts
am.canonical_scripts
> [
Script(
iso_15924_code='ethi',
is_canonical=True,
is_historical=None,
is_religious=None,
is_for_transliteration=None,
is_for_accessibility=None,
is_in_widespread_use=None,
has_official_status=None,
has_symbolic_value=None,
source='GOOGLE_RESEARCH',
)
]
# Mapping between codes
dir(ld.tag_conversion)
> [
'bcp_47_code2glottocode',
'bcp_47_code2iso_639_2b_code',
'bcp_47_code2iso_639_3_code',
'bcp_47_code2wikidata_id',
'bcp_47_code2wikipedia_id',
'glottocode2bcp_47_code',
'glottocode2iso_639_2b_code',
'glottocode2iso_639_3_code',
'glottocode2wikidata_id',
'glottocode2wikipedia_id',
'iso_639_2b_code2bcp_47_code',
'iso_639_2b_code2glottocode',
'iso_639_2b_code2iso_639_3_code',
'iso_639_2b_code2wikidata_id',
'iso_639_2b_code2wikipedia_id',
'iso_639_3_code2bcp_47_code',
'iso_639_3_code2glottocode',
'iso_639_3_code2iso_639_2b_code',
...
]
... and more, here are some full examples.
- Paper: https://aclanthology.org/2024.lrec-main.921/
- Github: https://github.com/google-research/url-nlp/tree/main/linguameta
- License: CC BY-SA 4.0
Individual sources (taken from LinguaMeta README):
LinguaMeta ID | Source | License type | Link |
---|---|---|---|
CLDR |
Unicode CLDR | non-standard | License |
GLOTTOLOG |
Glottolog | CC BY 4.0 | Site homepage |
GOOGLE_RESEARCH |
Language research conducted at Google | CC BY 4.0 | License |
IETF |
IETF | CC BY 4.0 | License |
ISO_639 |
SIL ISO 639 Registration Authority | non-standard | Terms of use |
WIKIDATA |
Wikidata | CC0, CC BY-SA 3.0 | Copyright info |
WIKIPEDIA |
Wikipedia | CC BY-SA 4.0 | Copyright info |
WIKTIONARY |
Wiktionary | CC BY-SA 4.0 | Copyright info |
- Source: https://meta.wikimedia.org/wiki/List_of_Wikipedias
- License: CC BY-SA 4.0
- Paper: https://aclanthology.org/2024.lrec-main.687/
- Github: https://github.com/cisnlp/GlotScript
- License: CC BY-SA 4.0
Individual sources (taken from GlotScript README):
- Wikipedia: Since Wikipedia writing system metadata is not easily redistributed, we provide our crawled version of the Writing System Text from Wikipedia in the sources folder.
- ScriptSource
- Unicode CLDR
- LangTag
- LREC_2800
- Omniglot
CC BY-SA 4.0
Footnotes
-
This is a combination of an
iso-693-3
orbcp-47
language tag andiso-15924
script tag. This is similar to the first parts of an IETF Tag, which, confusingly, can also be referred to as abcp-47
tag on its own. This is done in NLLB for instance. This is not wrong, but because data inqq
is based on LinguaMeta, who interpret just the first part of a IETF tag to be abcp-47
tag, we're sticking to LinguaMeta's interpretation ofbcp-47
and refer to the combined tag asnllb_style
. Theiso-15924
part of thenllb_style
tags are based on Glotscript, excluding Braille. ↩