-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance nickname processing #122
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
Created on Thu Mar 18 04:54:12 2021 | ||
|
||
@author: New User | ||
""" | ||
import re | ||
|
||
class TupleManager(dict): | ||
''' | ||
A dictionary with dot.notation access. Subclass of ``dict``. Makes the tuple constants | ||
more friendly. | ||
''' | ||
def __getattr__(self, attr): | ||
return self.get(attr) | ||
__setattr__= dict.__setitem__ | ||
__delattr__= dict.__delitem__ | ||
|
||
def __getstate__(self): | ||
return dict(self) | ||
|
||
def __setstate__(self, state): | ||
self.__init__(state) | ||
|
||
def __reduce__(self): | ||
return (TupleManager, (), self.__getstate__()) | ||
|
||
REGEXES = [ | ||
("spaces", re.compile(r"\s+", re.U)), | ||
("word", re.compile(r"(\w|\.)+", re.U)), | ||
("mac", re.compile(r'^(ma?c)(\w{2,})', re.I | re.U)), | ||
("initial", re.compile(r'^(\w\.|[A-Z])?$', re.U)), | ||
("quoted_word", re.compile(r'(?<!\w)\'([^\s]*?)\'(?!\w)', re.U), 'nickname'), | ||
("double_quotes", re.compile(r'\"(.*?)\"', re.U), 'nickname'), | ||
("parenthesis", re.compile(r'\((.*?)\)', re.U), 'nickname'), | ||
#("quoted_word", re.compile(r'(?<!\w)\'([^\s]*?)\'(?!\w)', re.U)), | ||
#("double_quotes", re.compile(r'\"(.*?)\"', re.U)), | ||
#("parenthesis", re.compile(r'\((.*?)\)', re.U)), | ||
("roman_numeral", re.compile(r'^(X|IX|IV|V?I{0,3})$', re.I | re.U)), | ||
("no_vowels",re.compile(r'^[^aeyiuo]+$', re.I | re.U)), | ||
("period_not_at_end",re.compile(r'.*\..+$', re.I | re.U)), | ||
("phd", re.compile(r'\s(ph\.?\s+d\.?)', re.I | re.U)), | ||
] | ||
|
||
r = TupleManager(tpl[:2] for tpl in REGEXES) | ||
nn_TM = TupleManager(tpl[:2] for tpl in REGEXES if tpl[-1] == 'nickname') | ||
nn = [tpl[1] for tpl in REGEXES if tpl[-1] == 'nickname'] | ||
|
||
rgx = re.compile(r"(?!\w)‘([^’]*?)’(?!\w)", re.U) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ | |
from nameparser.config import CONSTANTS | ||
from nameparser.config import Constants | ||
from nameparser.config import DEFAULT_ENCODING | ||
from nameparser.config.regexes import REGEXES | ||
|
||
ENCODING = 'utf-8' | ||
|
||
|
@@ -70,7 +71,7 @@ class HumanName(object): | |
_members = ['title','first','middle','last','suffix','nickname'] | ||
unparsable = True | ||
_full_name = '' | ||
|
||
def __init__(self, full_name="", constants=CONSTANTS, encoding=DEFAULT_ENCODING, | ||
string_format=None): | ||
self.C = constants | ||
|
@@ -79,7 +80,17 @@ def __init__(self, full_name="", constants=CONSTANTS, encoding=DEFAULT_ENCODING, | |
|
||
self.encoding = encoding | ||
self.string_format = string_format or self.C.string_format | ||
self._nickname_regexes = [tpl[1] | ||
for tpl in REGEXES | ||
if isinstance(tpl[-1], str) | ||
and 'nickname' in tpl[-1] | ||
] | ||
# full_name setter triggers the parse | ||
#======================================================== | ||
#IMPORTANT NOTE: | ||
# The followint statement must be the last one in the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be "The following statement...", also could combine with the existing comment:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. These two statements can be combined. I was 'bitten' by that when I started to change the code. I wanted to add text to really draw the attention of future collaborators. |
||
# __init__ function | ||
#======================================================== | ||
self.full_name = full_name | ||
|
||
def __iter__(self): | ||
|
@@ -419,18 +430,14 @@ def parse_nicknames(self): | |
white space to allow for quotes in names like O'Connor and Kawai'ae'a. | ||
Double quotes and parenthesis can span white space. | ||
|
||
Loops through 3 :py:data:`~nameparser.config.regexes.REGEXES`; | ||
`quoted_word`, `double_quotes` and `parenthesis`. | ||
Loops through :py:data:`~nameparser.config.regexes.REGEXES` with | ||
label/tag like "nickname" | ||
""" | ||
|
||
re_quoted_word = self.C.regexes.quoted_word | ||
re_double_quotes = self.C.regexes.double_quotes | ||
re_parenthesis = self.C.regexes.parenthesis | ||
|
||
for _re in (re_quoted_word, re_double_quotes, re_parenthesis): | ||
for _re in self._nickname_regexes: | ||
if _re.search(self._full_name): | ||
self.nickname_list += [x for x in _re.findall(self._full_name)] | ||
self._full_name = _re.sub('', self._full_name) | ||
self._full_name = _re.sub(' ', self._full_name) | ||
|
||
def squash_emoji(self): | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should use the local variable
regexes
to preserve the ability to pass it as an attribute to a new instance (not that anyone is doing that).([tpl[:2] for tpl in regexes])
.What is the slice doing here? It's not clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a tag/label to some of the tuples. The slice returns the first two items in the tuples, omitting the tag/label data. The TupleManager object can still be used in the code. The regexes variable in the constants() is no longer a set object, just a list of tuples. I did this to preserve the order of the regex patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. But If someone tries to instantiate passing a TuperManger to regexes, it will have no effect because you are using the global variable instead of the local one. Need to replace
REGEXES
forregexes
.ex:
name = HumanNam(regexes=myTupleManager)
would fail. (I guess I should have some tests for those instantiation attributes.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. The list of compiled regex patterns for nicknames is different than the regexes that is fed into tuplemanager. I thought I'd left the tuplemanager-based regexes alone. I might have gotten a little confused by variables/functions with the same name. I'll take another look at it.
Some clarification would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a simple mistake of using the module constant instead of the attribute passed to the class' init function.
change:
self.regexes = TupleManager([tpl[:2] for tpl in REGEXES])
to
self.regexes = TupleManager([tpl[:2] for tpl in regexes])
Here's a test that should pass but will fail with your code above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I edited the last line of my test to fix the equals test)