API¶
dragonmapper.hanzi¶
Identification and transcription functions for Chinese characters.
Importing this module takes a moment because it loads CC-CEDICT and Unihan data into memory.
Identifying Chinese Characters¶
Identifying a string of text as Traditional or Simplified Chinese is a complicated task. This module takes a simple approach that only looks at individual characters and not word choice. When these functions identify a string of text as Simplified, they aren’t saying, “This string of Chinese is Simplified Chinese and not Traditional Chinese.” Instead, see it as identifying the string as compatible with the Simplified Chinese character system.
Note
These identification functions and constants are imported from the Hanzi Identifier library.
The following constants are used as return values for identify()
.
- dragonmapper.hanzi.UNKNOWN¶
Indicates that a string doesn’t contain any Chinese characters.
- dragonmapper.hanzi.TRAD¶
- dragonmapper.hanzi.TRADITIONAL¶
Indicates that a string contains Chinese characters that are only used in Traditional Chinese.
- dragonmapper.hanzi.SIMP¶
- dragonmapper.hanzi.SIMPLIFIED¶
Indicates that a string contains Chinese characters that are only used in Simplified Chinese.
- dragonmapper.hanzi.BOTH¶
Indicates that a string contains Chinese characters that are compatible with both Traditional and Simplified Chinese.
- dragonmapper.hanzi.MIXED¶
Indicates that a string contains Chinese characters that are found exclusively in Traditional and Simplified Chinese.
- dragonmapper.hanzi.identify()[source]¶
Identify what kind of Chinese characters a string contains.
s is a string to examine. The string’s Chinese characters are tested to see if they are compatible with the Traditional or Simplified characters systems, compatible with both, or contain a mixture of Traditional and Simplified characters. The
TRADITIONAL
,SIMPLIFIED
,BOTH
, orMIXED
constants are returned to indicate the string’s identity. If s contains no Chinese characters, thenUNKNOWN
is returned.All characters in a string that aren’t found in the CC-CEDICT dictionary are ignored.
Because the Traditional and Simplified Chinese character systems overlap, a string containing Simplified characters could identify as
SIMPLIFIED
orBOTH
depending on if the characters are also Traditional characters. To make testing the identity of a string easier, the functionsis_traditional()
,is_simplified()
, andhas_chinese()
are provided.
- dragonmapper.hanzi.has_chinese()[source]¶
Check if a string has Chinese characters in it.
- This is a faster version of:
>>> identify('foo') is not UNKNOWN
Transcribing Chinese Characters¶
The following functions transliterate Chinese characters into various transcription systems.
- dragonmapper.hanzi.to_pinyin(s, delimiter=' ', all_readings=False, container='[]', accented=True)[source]¶
Convert a string’s Chinese characters to Pinyin readings.
s is a string containing Chinese characters. accented is a boolean value indicating whether to return accented or numbered Pinyin readings.
delimiter is the character used to indicate word boundaries in s. This is used to differentiate between words and characters so that a more accurate reading can be returned.
all_readings is a boolean value indicating whether or not to return all possible readings in the case of words/characters that have multiple readings. container is a two character string that is used to enclose words/characters if all_readings is
True
. The default'[]'
is used like this:'[READING1/READING2]'
.Characters not recognized as Chinese are left untouched.
- dragonmapper.hanzi.to_zhuyin(s, delimiter=' ', all_readings=False, container='[]')[source]¶
Convert a string’s Chinese characters to Zhuyin readings.
s is a string containing Chinese characters.
delimiter is the character used to indicate word boundaries in s. This is used to differentiate between words and characters so that a more accurate reading can be returned.
all_readings is a boolean value indicating whether or not to return all possible readings in the case of words/characters that have multiple readings. container is a two character string that is used to enclose words/characters if all_readings is
True
. The default'[]'
is used like this:'[READING1/READING2]'
.Characters not recognized as Chinese are left untouched.
- dragonmapper.hanzi.to_ipa(s, delimiter=' ', all_readings=False, container='[]')[source]¶
Convert a string’s Chinese characters to IPA.
s is a string containing Chinese characters.
delimiter is the character used to indicate word boundaries in s. This is used to differentiate between words and characters so that a more accurate reading can be returned.
all_readings is a boolean value indicating whether or not to return all possible readings in the case of words/characters that have multiple readings. container is a two character string that is used to enclose words/characters if all_readings is
True
. The default'[]'
is used like this:'[READING1/READING2]'
.Characters not recognized as Chinese are left untouched.
dragonmapper.transcriptions¶
Identification and conversion functions for Chinese transcription systems.
Identifying Chinese Transcriptions¶
The following constants are used as return values for identify()
.
- dragonmapper.transcriptions.UNKNOWN¶
Indicates that a string isn’t a recognized Chinese transcription.
- dragonmapper.transcriptions.PINYIN¶
Indicates that a string’s content consists of Pinyin.
- dragonmapper.transcriptions.ZHUYIN¶
Indicates that a string’s content consists of Zhuyin (Bopomofo).
- dragonmapper.transcriptions.IPA¶
Indicates that a string’s content consists of the International Phonetic Alphabet (IPA).
- dragonmapper.transcriptions.identify(s)[source]¶
Identify a given string’s transcription system.
s is the string to identify. The string is checked to see if its contents are valid Pinyin, Zhuyin, or IPA. The
PINYIN
,ZHUYIN
, andIPA
constants are returned to indicate the string’s identity. If s is not a valid transcription system, thenUNKNOWN
is returned.When checking for valid Pinyin or Zhuyin, testing is done on a syllable level, not a character level. For example, just because a string is composed of characters used in Pinyin, doesn’t mean that it will identify as Pinyin; it must actually consist of valid Pinyin syllables. The same applies for Zhuyin.
When checking for IPA, testing is only done on a character level. In other words, a string just needs to consist of Chinese IPA characters in order to identify as IPA.
The following functions use identify()
, but don’t require typing the
names of the module-level constants.
The above functions is_pinyin()
and is_zhuyin()
check for valid
syllables. This takes more time than checking on the character-level, but is more
accurate. If you want to simply know if a string is compatible with Pinyin or Zhuyin,
but don’t need to know if each syllable is actually valid, then use these functions:
- dragonmapper.transcriptions.is_pinyin_compatible(s)[source]¶
Checks if s is consists of Pinyin-compatible characters.
This does not check if s contains valid Pinyin syllables; for that see
is_pinyin()
.This function checks that all characters in s exist in
zhon.pinyin.printable
.
- dragonmapper.transcriptions.is_zhuyin_compatible(s)[source]¶
Checks if s is consists of Zhuyin-compatible characters.
This does not check if s contains valid Zhuyin syllables; for that see
is_zhuyin()
.Besides Zhuyin characters and tone marks, spaces are also accepted. This function checks that all characters in s exist in
zhon.zhuyin.characters
,zhon.zhuyin.marks
, or' '
.
Converting Chinese Transcriptions¶
Converting between the various transcription systems is fairly simple. A few things to note:
When converting from Pinyin to Zhuyin or IPA, spaces are added between each syllable because Zhuyin and IPA are not meant to be read in sentence format. They don’t have the equivalent of Pinyin’s apostrophe to separate certain syllables.
When converting from Pinyin to Zhuyin or IPA, all syllable-separating apostrophes are removed. Those that don’t separate syllables (like quotation marks) are left untouched.
In Pinyin,
'v'
is considered another way to write'ü'
. The*_to_pinyin
functions all output that vowel as'ü'
.
These conversion functions come in two flavors: functions that convert
individual syllabes and functions that convert sentence-style text. If you
only have individual syllables to convert, it’s quicker to use the
*_syllable_to_*
functions that assume the input is a single valid syllable.
Syllable Conversion¶
- dragonmapper.transcriptions.numbered_syllable_to_accented(s)[source]¶
Convert numbered Pinyin syllable s to an accented Pinyin syllable.
It implements the following algorithm to determine where to place tone marks:
- If the syllable has an ‘a’, ‘e’, or ‘o’ (in that order), put the
tone mark over that vowel.
Otherwise, put the tone mark on the last vowel.
- dragonmapper.transcriptions.accented_syllable_to_numbered(s)[source]¶
Convert accented Pinyin syllable s to a numbered Pinyin syllable.
- dragonmapper.transcriptions.pinyin_syllable_to_zhuyin(s)[source]¶
Convert Pinyin syllable s to a Zhuyin syllable.
- dragonmapper.transcriptions.pinyin_syllable_to_ipa(s)[source]¶
Convert Pinyin syllable s to an IPA syllable.
- dragonmapper.transcriptions.zhuyin_syllable_to_pinyin(s, accented=True)[source]¶
Convert Zhuyin syllable s to a Pinyin syllable.
If accented is
True
, diacritics are added to the Pinyin syllable. If it’sFalse
, numbers are used to indicate the syllable’s tone.
- dragonmapper.transcriptions.zhuyin_syllable_to_ipa(s)[source]¶
Convert Zhuyin syllable s to an IPA syllable.
Sentence-Style Conversion¶
- dragonmapper.transcriptions.numbered_to_accented(s)[source]¶
Convert all numbered Pinyin syllables in s to accented Pinyin.
- dragonmapper.transcriptions.accented_to_numbered(s)[source]¶
Convert all accented Pinyin syllables in s to numbered Pinyin.
- dragonmapper.transcriptions.pinyin_to_zhuyin(s)[source]¶
Convert all Pinyin syllables in s to Zhuyin.
Spaces are added between connected syllables and syllable-separating apostrophes are removed.
- dragonmapper.transcriptions.pinyin_to_ipa(s)[source]¶
Convert all Pinyin syllables in s to IPA.
Spaces are added between connected syllables and syllable-separating apostrophes are removed.
- dragonmapper.transcriptions.zhuyin_to_pinyin(s, accented=True)[source]¶
Convert all Zhuyin syllables in s to Pinyin.
If accented is
True
, diacritics are added to the Pinyin syllables. If it’sFalse
, numbers are used to indicate tone.
Combined: Identification and Conversion¶
These functions take an unidentified transcription string and identify it, then convert it into the target transcription system. If you know you’ll be identifying your strings before you convert them, these can save you a few lines of code.