Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance nickname processing #122

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion nameparser/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ def __init__(self,
self.first_name_titles = SetManager(first_name_titles)
self.conjunctions = SetManager(conjunctions)
self.capitalization_exceptions = TupleManager(capitalization_exceptions)
self.regexes = TupleManager(regexes)
self.regexes = TupleManager([tpl[:2] for tpl in REGEXES])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use the local variable regexes to preserve the ability to pass it as an attribute to a new instance (not that anyone is doing that). ([tpl[:2] for tpl in regexes]).

What is the slice doing here? It's not clear to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a tag/label to some of the tuples. The slice returns the first two items in the tuples, omitting the tag/label data. The TupleManager object can still be used in the code. The regexes variable in the constants() is no longer a set object, just a list of tuples. I did this to preserve the order of the regex patterns.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But If someone tries to instantiate passing a TuperManger to regexes, it will have no effect because you are using the global variable instead of the local one. Need to replace REGEXES for regexes.

ex: name = HumanNam(regexes=myTupleManager) would fail. (I guess I should have some tests for those instantiation attributes.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. The list of compiled regex patterns for nicknames is different than the regexes that is fed into tuplemanager. I thought I'd left the tuplemanager-based regexes alone. I might have gotten a little confused by variables/functions with the same name. I'll take another look at it.

Some clarification would be helpful.

Copy link
Owner

@derek73 derek73 May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a simple mistake of using the module constant instead of the attribute passed to the class' init function.

change:
self.regexes = TupleManager([tpl[:2] for tpl in REGEXES])
to
self.regexes = TupleManager([tpl[:2] for tpl in regexes])

Here's a test that should pass but will fail with your code above.

    def test_custom_regex_constant(self):
        t = {'test': 'test'}
        c = Constants(regexes=t)
        self.assertEqual(c.regexes, t)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I edited the last line of my test to fix the equals test)

self._pst = None

@property
Expand Down
27 changes: 22 additions & 5 deletions nameparser/config/regexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,37 @@
'[\u2600-\u26FF\u2700-\u27BF])+',
re.UNICODE)

REGEXES = set([
REGEXES = [
("spaces", re.compile(r"\s+", re.U)),
("word", re.compile(r"(\w|\.)+", re.U)),
("mac", re.compile(r'^(ma?c)(\w{2,})', re.I | re.U)),
("initial", re.compile(r'^(\w\.|[A-Z])?$', re.U)),
("quoted_word", re.compile(r'(?<!\w)\'([^\s]*?)\'(?!\w)', re.U)),
("double_quotes", re.compile(r'\"(.*?)\"', re.U)),
("parenthesis", re.compile(r'\((.*?)\)', re.U)),
("double_apostrophe_ASCII", re.compile(r"(?!\w)''(\w[^']*?)''(?!\w)", re.U), 'nickname'),
("smart_quote", re.compile(r"(?!\w)“(\w[^”]*?)”(?!\w)", re.U), 'nickname'),
("smart_single_quote", re.compile(r"(?!\w)‘(\w[^’]*?)’(?!\w)", re.U), 'nickname'),
("grave_accent", re.compile(r'(?!\w)`(\w[^`]*?)`(?!\w)', re.U), 'nickname'),
("grave_acute", re.compile(r'(?!\w)`(\w[^´]*?)´(?!\w)', re.U), 'nickname'),
("apostrophe_ASCII", re.compile(r"(?!\w)'(\w[^']*?)'(?!\w)", re.U), 'nickname'),
("quote_ASCII", re.compile(r'(?!\w)"(\w[^"]*?)"(?!\w)', re.U), 'nickname'),
("parenthesis", re.compile(r'(?!\w)\((\w[^)]*?)\)(?!\w)', re.U), 'nickname'),
("roman_numeral", re.compile(r'^(X|IX|IV|V?I{0,3})$', re.I | re.U)),
("no_vowels",re.compile(r'^[^aeyiuo]+$', re.I | re.U)),
("period_not_at_end",re.compile(r'.*\..+$', re.I | re.U)),
("emoji",re_emoji),
("phd", re.compile(r'\s(ph\.?\s+d\.?)', re.I | re.U)),
])
]
"""
All regular expressions used by the parser are precompiled and stored in the config.

REGEX tuple positions are:
[0] - name of the pattern, used in code as named attribute
[1] - compiled pattern
[2] - (optional) label/tag of the pattern, used in code for
filtering patterns

All nickname patterns should follow this pattern:
(?!\w)leading_delim([^trailing_delim]*?)trailing_delim(?!\w)

Nicknames are assume to be delimited by non-word characters.

"""
49 changes: 49 additions & 0 deletions nameparser/config/testREGEXES.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 18 04:54:12 2021

@author: New User
"""
import re

class TupleManager(dict):
'''
A dictionary with dot.notation access. Subclass of ``dict``. Makes the tuple constants
more friendly.
'''
def __getattr__(self, attr):
return self.get(attr)
__setattr__= dict.__setitem__
__delattr__= dict.__delitem__

def __getstate__(self):
return dict(self)

def __setstate__(self, state):
self.__init__(state)

def __reduce__(self):
return (TupleManager, (), self.__getstate__())

REGEXES = [
("spaces", re.compile(r"\s+", re.U)),
("word", re.compile(r"(\w|\.)+", re.U)),
("mac", re.compile(r'^(ma?c)(\w{2,})', re.I | re.U)),
("initial", re.compile(r'^(\w\.|[A-Z])?$', re.U)),
("quoted_word", re.compile(r'(?<!\w)\'([^\s]*?)\'(?!\w)', re.U), 'nickname'),
("double_quotes", re.compile(r'\"(.*?)\"', re.U), 'nickname'),
("parenthesis", re.compile(r'\((.*?)\)', re.U), 'nickname'),
#("quoted_word", re.compile(r'(?<!\w)\'([^\s]*?)\'(?!\w)', re.U)),
#("double_quotes", re.compile(r'\"(.*?)\"', re.U)),
#("parenthesis", re.compile(r'\((.*?)\)', re.U)),
("roman_numeral", re.compile(r'^(X|IX|IV|V?I{0,3})$', re.I | re.U)),
("no_vowels",re.compile(r'^[^aeyiuo]+$', re.I | re.U)),
("period_not_at_end",re.compile(r'.*\..+$', re.I | re.U)),
("phd", re.compile(r'\s(ph\.?\s+d\.?)', re.I | re.U)),
]

r = TupleManager(tpl[:2] for tpl in REGEXES)
nn_TM = TupleManager(tpl[:2] for tpl in REGEXES if tpl[-1] == 'nickname')
nn = [tpl[1] for tpl in REGEXES if tpl[-1] == 'nickname']

rgx = re.compile(r"(?!\w)‘([^’]*?)’(?!\w)", re.U)
25 changes: 16 additions & 9 deletions nameparser/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from nameparser.config import CONSTANTS
from nameparser.config import Constants
from nameparser.config import DEFAULT_ENCODING
from nameparser.config.regexes import REGEXES

ENCODING = 'utf-8'

Expand Down Expand Up @@ -70,7 +71,7 @@ class HumanName(object):
_members = ['title','first','middle','last','suffix','nickname']
unparsable = True
_full_name = ''

def __init__(self, full_name="", constants=CONSTANTS, encoding=DEFAULT_ENCODING,
string_format=None):
self.C = constants
Expand All @@ -79,7 +80,17 @@ def __init__(self, full_name="", constants=CONSTANTS, encoding=DEFAULT_ENCODING,

self.encoding = encoding
self.string_format = string_format or self.C.string_format
self._nickname_regexes = [tpl[1]
for tpl in REGEXES
if isinstance(tpl[-1], str)
and 'nickname' in tpl[-1]
]
# full_name setter triggers the parse
#========================================================
#IMPORTANT NOTE:
# The followint statement must be the last one in the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "The following statement...", also could combine with the existing comment:

The following statement must be the last line in _init__ because it triggers the parse using :py:func:`full_name.setter`.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. These two statements can be combined. I was 'bitten' by that when I started to change the code. I wanted to add text to really draw the attention of future collaborators.

# __init__ function
#========================================================
self.full_name = full_name

def __iter__(self):
Expand Down Expand Up @@ -419,18 +430,14 @@ def parse_nicknames(self):
white space to allow for quotes in names like O'Connor and Kawai'ae'a.
Double quotes and parenthesis can span white space.

Loops through 3 :py:data:`~nameparser.config.regexes.REGEXES`;
`quoted_word`, `double_quotes` and `parenthesis`.
Loops through :py:data:`~nameparser.config.regexes.REGEXES` with
label/tag like "nickname"
"""

re_quoted_word = self.C.regexes.quoted_word
re_double_quotes = self.C.regexes.double_quotes
re_parenthesis = self.C.regexes.parenthesis

for _re in (re_quoted_word, re_double_quotes, re_parenthesis):
for _re in self._nickname_regexes:
if _re.search(self._full_name):
self.nickname_list += [x for x in _re.findall(self._full_name)]
self._full_name = _re.sub('', self._full_name)
self._full_name = _re.sub(' ', self._full_name)

def squash_emoji(self):
"""
Expand Down