PyNLPIR API¶
pynlpir
¶
Provides an easy-to-use Python interface to NLPIR/ICTCLAS.
The functions below are not as extensive as the full set of functions exported
by NLPIR (for that, see pynlpir.nlpir
). A few design choices have been
made with these functions as well, e.g. they have been renamed and their output
is formatted differently.
The functions in this module all assume input is either a string or encoded
using the encoding specified when open()
is called.
These functions return strings.
After importing this module, you must call open()
in order to initialize
the NLPIR API. When you’re done using the NLPIR API, call close()
to exit
the API.
- pynlpir.open(data_dir=nlpir.PACKAGE_DIR, encoding=ENCODING, encoding_errors=ENCODING_ERRORS, license_code=None)[source]¶
Initializes the NLPIR API.
This calls the function
Init()
.- Parameters:
data_dir (str) – The absolute path to the directory that has NLPIR’s Data directory (defaults to
pynlpir.nlpir.PACKAGE_DIR
).encoding (str) – The encoding that the Chinese source text will be in (defaults to
'utf_8'
). Possible values include'gbk'
,'utf_8'
, or'big5'
.encoding_errors (str) – The desired encoding error handling scheme. Possible values include
'strict'
,'ignore'
, and'replace'
. The default error handler is ‘strict’ meaning that encoding errors raiseValueError
(or a more codec specific subclass, such asUnicodeEncodeError
).license_code (str) – The license code that should be used when initializing NLPIR. This is generally only used by commercial users.
- Raises:
RuntimeError – The NLPIR API failed to initialize. Sometimes, NLPIR leaves an error log in the current working directory or NLPIR’s
Data
directory that provides more detailed messages (but this isn’t always the case).LicenseError – The NLPIR license appears to be invalid or expired.
- pynlpir.close()[source]¶
Exits the NLPIR API and frees allocated memory. This calls the function
Exit()
.
- pynlpir.segment(s, pos_tagging=True, pos_names='parent', pos_english=True)[source]¶
Segment Chinese text s using NLPIR.
The segmented tokens are returned as a list. Each item of the list is a string if pos_tagging is False, e.g.
['我们', '是', ...]
. If pos_tagging is True, then each item is a tuple ((token, pos)
), e.g.[('我们', 'pronoun'), ('是', 'verb'), ...]
.If pos_tagging is True and a segmented word is not recognized by NLPIR’s part of speech tagger, then the part of speech code/name will be returned as
None
(e.g. a space returns as(' ', None)
).This uses the function
ParagraphProcess()
to segment s.- Parameters:
s – The Chinese text to segment. s should be a string or UTF-8 encoded bytes.
pos_tagging (bool) – Whether or not to include part of speech tagging (defaults to
True
).pos_names (
str
orNone
) – What type of part of speech names to return. This argument is only used if pos_tagging isTrue
.None
means only the original NLPIR part of speech code will be returned. Other thanNone
, pos_names may be one of'parent'
,'child'
,'all'
, or'raw'
. Defaults to'parent'
.'parent'
indicates that only the most generic name should be used, e.g.'noun'
for'nsf'
.'child'
indicates that the most specific name should be used, e.g.'transcribed toponym'
for'nsf'
.'all'
indicates that all names should be used, e.g.'noun:toponym:transcribed toponym'
for'nsf'
.'raw'
indicates that original names should be used.pos_english (bool) – Whether to use English or Chinese for the part of speech names, e.g.
'conjunction'
or'连词'
. Defaults toTrue
. This is only used if pos_tagging isTrue
.
- pynlpir.get_key_words(s, max_words=50, weighted=False)[source]¶
Determines key words in Chinese text s.
The key words are returned in a list. If weighted is
True
, then each list item is a tuple:(word, weight)
, where weight is a float. If it’s False, then each list item is a string.This uses the function
GetKeyWords()
to determine the key words in s.- Parameters:
s – The Chinese text to analyze. s should be a string or UTF-8 encoded bytes.
max_words (int) – The maximum number of key words to find (defaults to
50
).weighted (bool) – Whether or not to return the key words’ weights (defaults to
True
).
pynlpir.nlpir
¶
This module uses ctypes
to provide a Python API to NLPIR. Other than
argument names used in this documentation, the functions are left the same as
they are in NLPIR.
When this module is imported, the NLPIR library is imported and the functions
listed below are exported by a ctypes.CDLL
instance.
There is a less extensive, easier-to-use NLPIR interface directly in the
pynlpir
module.
Init()
must be called before any other NLPIR functions can be called.
After using the API, you can call Exit()
to exit the API and free up
allocated memory.
- pynlpir.nlpir.PACKAGE_DIR¶
The absolute path to this package (used by NLPIR to find its
Data
directory).
- pynlpir.nlpir.LIB_DIR¶
The absolute path to this path’s lib directory.
- pynlpir.nlpir.libNLPIR¶
A
ctypes.CDLL
instance for the NLPIR API library.
- pynlpir.nlpir.GBK_CODE 0¶
NLPIR’s GBK encoding constant.
- pynlpir.nlpir.UTF8_CODE 1¶
NLPIR’s UTF-8 encoding constant.
- pynlpir.nlpir.BIG5_CODE 2¶
NLPIR’s BIG5 encoding constant.
- pynlpir.nlpir.GBK_FANTI_CODE 3¶
NLPIR’s GBK (Traditional Chinese) encoding constant.
- pynlpir.nlpir.ICT_POS_MAP_SECOND 0¶
ICTCLAS part of speech constant #2.
- pynlpir.nlpir.ICT_POS_MAP_FIRST 1¶
ICTCLAS part of speech constant #1.
- pynlpir.nlpir.PKU_POS_MAP_SECOND 2¶
PKU part of speech constant #2.
- pynlpir.nlpir.PKU_POS_MAP_FIRST 3¶
PKU part of speech constant #1.
- class pynlpir.nlpir.ResultT[source]¶
The NLPIR
result_t
structure. Inherits fromctypes.Structure
.- start¶
The start position of the word in the source Chinese text string.
- length¶
The detected word’s length.
- sPOS¶
A string representing the word’s part of speech.
- word_type¶
If the word is found in the user’s dictionary.
- weight¶
The weight of the detected word.
- pynlpir.nlpir.get_func(name, argtypes=None, restype=c_int, lib=libNLPIR)[source]¶
Retrieves the corresponding NLPIR function.
- Parameters:
name (str) – The name of the NLPIR function to get.
argtypes (list) – A list of
ctypes
data types that correspond to the function’s argument types.restype – A
ctypes
data type that corresponds to the function’s return type (only needed if the return type isn’tctypes.c_int
).lib – A
ctypes.CDLL
instance for the NLPIR API library where the function will be retrieved from (defaults tolibNLPIR
).
- Returns:
The exported function. It can be called like any other Python callable.
- pynlpir.nlpir.Init(data_dir, encoding=GBK_CODE, license_code=None)¶
Initializes the NLPIR API. This must be called before any other NLPIR functions will work.
- Parameters:
data_dir (str) – The path to the NLPIR data folder’s parent folder.
PACKAGE_DIR
can be used for this.encoding (int) – Which encoding NLPIR should expect.
GBK_CODE
,UTF8_CODE
,BIG5_CODE
, andGBK_FANTI_CODE
should be used for this argument.license_code (str) – A license code for unlimited usage. Most users shouldn’t need to use this.
- Returns:
Whether or not the function executed successfully.
- Return type:
bool
- pynlpir.nlpir.Exit()¶
Exits the NLPIR API and frees allocated memory.
- Returns:
Whether or not the function executed successfully.
- Return type:
bool
- pynlpir.nlpir.ParagraphProcess(s, pos_tagging=True)¶
Segments a string of Chinese text (encoded using the encoding specified when
Init()
was called).- Parameters:
s (str) – The Chinese text to process.
pos_tagging (bool) – Whether or not to return part of speech tags with the segmented words..
- Returns:
The segmented words.
- Return type:
str
- pynlpir.nlpir.ParagraphProcessA(s, size_pointer, user_dict=True)¶
Segments a string of Chinese text (encoded using the encoding specified when
Init()
was called).Here is an example of how to use this function:
size = ctypes.c_int() result = ParagraphProcessA(s, ctypes.byref(size), False) result_t_vector = ctypes.cast(result, ctypes.POINTER(ResultT)) words = [] for i in range(0, size.value): r = result_t_vector[i] word = s[r.start:r.start+r.length] words.append((word, r.sPOS))
- Parameters:
s (str) – The Chinese text to process.
size_pointer – A pointer to a
ctypes.c_int
that will be set to the result vector’s size.user_dict (bool) – Whether or not to use the user dictionary.
- Returns:
A pointer to the result vector. Each result in the result vector is an instance of
ResultT
.
- pynlpir.nlpir.FileProcess(source_filename, result_filename, pos_tagging=True)¶
Processes a text file.
- Parameters:
source_filename (str) – The name of the file that contains the source text.
result_filename (str) – The name of the file where the results should be written.
pos_tagging (bool) – Whether or not to include part of speech tags in the output.
- Returns:
If the function executed successfully, the processing speed is returned (
float
). Otherwise,0
is returned.
- pynlpir.nlpir.ImportUserDict(filename)¶
Imports a user-defined dictionary from a text file.
- Parameters:
filename (str) – The filename of the user’s dictionary file.
- Returns:
The number of lexical entries successfully imported.
- Return type:
int
- pynlpir.nlpir.AddUserWord(word)¶
Adds a word to the user’s dictionary.
- Parameters:
word (str) – The word to add to the dictionary.
- Returns:
1
if the word was added successfully, otherwise0
.- Return type:
int
- pynlpir.nlpir.SaveTheUsrDic()¶
Writes the user’s dictionary to disk.
- Returns:
1
if the dictionary was saved successfully, otherwise0
.- Return type:
int
- pynlpir.nlpir.DelUsrWord(word)¶
Deletes a word from the user’s dictionary.
- Parameters:
word (str) – The word to delete.
- Returns:
-1
if the word doesn’t exist in the dictionary. Otherwise, the pointer to the word deleted.- Return type:
int
- pynlpir.nlpir.GetKeyWords(s, max_words=50, weighted=False)¶
Extracts key words from a string of Chinese text.
- Parameters:
s (str) – The Chinese text to process.
max_words (int) – The maximum number of key words to return.
weighted (bool) – Whether or not the key words’ weights are returned.
- Returns:
The key words.
- Return type:
str
- pynlpir.nlpir.GetFileKeyWords(filename, max_words=50, weighted=False)¶
Extracts key words from Chinese text in a file.
- Parameters:
filename (str) – The file to process.
max_words (int) – The maximum number of key words to return.
weighted (bool) – Whether or not the key words’ weights are returned.
- Returns:
The key words.
- Return type:
str
- pynlpir.nlpir.GetNewWords(s, max_words=50, weighted=False)¶
Extracts new words from a string of Chinese text.
- Parameters:
s (str) – The Chinese text to process.
max_words (int) – The maximum number of new words to return.
weighted (bool) – Whether or not the new words’ weights are returned.
- Returns:
The new words.
- Return type:
str
- pynlpir.nlpir.GetFileNewWords(filename, max_words=50, weighted=False)¶
Extracts new words from Chinese text in a file.
- Parameters:
filename (str) – The file to process.
max_words (int) – The maximum number of new words to return.
weighted (bool) – Whether or not the new words’ weights are returned.
- Returns:
The new words.
- Return type:
str
- pynlpir.nlpir.FingerPrint(s)¶
Extracts a fingerprint from a string of Chinese text.
- Parameters:
s (str) – The Chinese text to process.
- Returns:
The fingerprint of the content.
0
if the function failed.
- pynlpir.nlpir.SetPOSmap(pos_map)¶
Selects which part of speech map to use.
- Parameters:
pos_map (int) – The part of speech map that should be used. This should be one of
ICT_POS_MAP_FIRST
,ICT_POS_MAP_SECOND
,PKU_POS_MAP_FIRST
, orPKU_POS_MAP_SECOND
.- Returns:
0
if the function failed, otherwise1
.- Return type:
int
- pynlpir.nlpir.NWI_Start()¶
Initializes new word identification.
- Returns:
True
if the function succeeded;False
if it failed.- Return type:
bool
- pynlpir.nlpir.NWI_AddFile(filename)¶
Adds the words in a text file.
- Parameters:
filename (string) – The text file’s filename.
- Returns:
True
if the function succeeded;False
if it failed.- Return type:
bool
- pynlpir.nlpir.NWI_AddMem(filename)¶
Increases the allotted memory for new word identification.
- Parameters:
filename (string) – NLPIR’s documentation is unclear on what this argument is for.
- Returns:
True
if the function succeeded;False
if it failed.- Return type:
bool
- pynlpir.nlpir.NWI_Complete()¶
Terminates new word identifcation. Frees up memory and resources.
- Returns:
True
if the function succeeded;False
if it failed.- Return type:
bool
- pynlpir.nlpir.NWI_GetResult(weight)¶
Returns the new word identification results.
- Parameters:
weight (bool) – Whether or not to include word weights in the results.
- Returns:
True
if the function succeeded;False
if it failed.- Returns:
The identified words.
- Return type:
str
- pynlpir.nlpir.NWI_Results2UserDict()¶
Adds the newly identified words to the user dictionary.
This function should only be called after
NWI_Complete()
is called.If you want to save the user dictionary, consider running
SaveTheUsrDic()
.- Returns:
1
if the function succeeded;0
if it failed.- Return type:
int
pynlpir.pos_map
¶
Part of speech mapping constants and functions for NLPIR/ICTCLAS.
This module is used by pynlpir
to format segmented words for output.
- pynlpir.pos_map.POS_MAP¶
A dictionary that maps part of speech codes returned by NLPIR to human-readable names (English and Chinese).
- pynlpir.pos_map.get_pos_name(code, name='parent', english=True)[source]¶
Gets the part of speech name for code.
- Parameters:
code (str) – The part of speech code to lookup, e.g.
'nsf'
.name (str) – Which part of speech name to include in the output. Must be one of
'parent'
,'child'
,'all'
, or'raw'
. Defaults to'parent'
.'parent'
indicates that only the most generic name should be used, e.g.'noun'
for'nsf'
.'child'
indicates that the most specific name should be used, e.g.'transcribed toponym'
for'nsf'
.'all'
indicates that all names should be used, e.g.('noun', 'toponym', 'transcribed toponym')
for'nsf'
.'raw'
indicates the original names.english (bool) – Whether to return an English or Chinese name.
- Returns:
str
if name is'parent'
or'child'
.tuple
if name is'all'
.None
if the part of speech code is not recognized.