BabelStrike

Purpose

The purpose of this tool is to normalize and generate possible usernames out of a full names list that may include names written in multiple (non-English) languages, common problem occuring from scraped employee names lists (e.g. from Linkedin).

BabelStrike takes a full names list as input and performs 1. Romanization of non-English names (based on language alphabet transliteration maps) AND|OR 2. implements name-to-username conversions based on various naming convention rules.

Romanization feature currently supports: Greek, Spanish and Polish. Looking for people to contribute language classes, check how it's done and contribute!

Video Presentation

https://www.youtube.com/watch?v=550S6oAYfDo

Preview

Name to Usernames Convertion Rules

Table of rules for generating usernames:

{f} = first letter of Name, {fi} = first two letters of Name ...
{l} = first letter of Lastname, {la} = first two letters of Lastname ...

The rules can be automatically aplied to the reversed version of the full name as well, by using [-a].


{firstname}{lastname}	{f}{l}	{lastname}{f}	{f}{la}	{firstname}
{firstname}.{lastname}	{f}.{l}	{lastname}.{f}	{f}.{la}	{lastname}
{firstname}_{lastname}	{f}_{l}	{lastname}_{f}	{f}_{la}
{firstname}-{lastname}	{f}-{l}	{lastname}-{f}	{f}-{la}
{firstname} {lastname}	{f} {l}	{lastname} {f}	{f} {la}
{f}{lastname}	{fi}{lastname}	{lastname}{fi}	{la}{f}
{f}.{lastname}	{fi}.{lastname}	{lastname}.{fi}	{la}.{f}
{f}_{lastname}	{fi}_{lastname}	{lastname}_{fi}	{la}_{f}
{f}-{lastname}	{fi}-{lastname}	{lastname}-{fi}	{la}-{f}
{f} {lastname}	{fi} {lastname}	{lastname} {fi}	{la} {f}

Conversion rules when middle name is detected


{firstname}{middle}{lastname}	{f}{m}{l}	{lastname}{middle}{f}	{f}{m}{l}
{firstname}.{middle}.{lastname}	{f}.{m}.{l}	{lastname}.{middle}.{f}	{f}.{m}.{l}
{firstname}_{middle}_{lastname}	{f}_{m}_{l}	{lastname}_{middle}_{f}	{f}_{m}_{l}
{firstname}-{middle}-{lastname}	{f}-{m}-{l}	{lastname}-{middle}-{f}	{f}-{m}-{l}
{firstname} {middle} {lastname}	{f} {m} {l}	{lastname} {middle} {f}	{f} {m} {l}
{f}{middle}{lastname}	{fi}{middle}{lastname}	{lastname}{middle}{fi}	{firstname}
{f}.{middle}.{lastname}	{fi}.{middle}.{lastname}	{lastname}.{middle}.{fi}	{middle}
{f}_{middle}_{lastname}	{fi}_{middle}_{lastname}	{lastname}_{middle}_{fi}	{lastname}
{f}-{middle}-{lastname}	{fi}-{middle}-{lastname}	{lastname}-{middle}-{fi}
{f} {middle} {lastname}	{fi} {middle} {lastname}	{lastname} {middle} {fi}

Installation & Usage

Install with pip:

pip3 install -r requirements.txt

Usage:

babelstrike.py [-h] -f FILE [-r] [-c] [-a] [-d DOMAIN] [-u] [-q]

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  File to process.
  -r, --romanization    Transliterate names to the latin alphabet.
  -c, --convertion      Perform name-to-username convertions.
  -a, --auto-reverse    Perform name-to-username convertion patterns against the reversed version of each name as well.
  -d DOMAIN, --domain DOMAIN
                        Comma seperated list of domains to add as prefix to each generated username (e.g. EVILCORP\scott.henderson).
  -u, --update          Pull the latest version from the original repo.
  -q, --quiet           Do not print the banner on startup.

Contributions

In order for the Romanization feature to be accurate, I decided to use custom character substitution maps for each language preferably made by native speakers. I'm looking for some cool people around the world to create such maps that are basically a Python dictionary.

Instructions

If you want to contribute a language Class all you have to do is:

Find an official Romanization standard for your language's alphabet (e.g. in Wikipedia),
Copy a language Class file from the language_classes folder to use as a template (I suggest you use Greek.py),
Edit the filename and the Class name to represent your language,
Edit the char_substitution_map dictionary and create the character substitution map (Important: Don't change the name of the dictionary),
- Map lowercase letters only,
- Take in consideration double or triple letter sounds that may be transliterated in a single character of the Latin alphabet,
- Take in consideration accented characters (e.g. à, è, ì, ò, ù),
- When a letter has more than one transliteration equivalents, use a list to include all of them (BabelStrike will handle all variations). Example:
```
# In Greek, the letter 'υ' may be transliterated as 'y' or 'u'. 
# This is how it should be declared in the character mapping dictionary:

char_substitution_map = {
  'ά' : 'a',
  'έ' : 'e',
  'υ' : ['y','u']
}
```
Save the new class in the language_classes folder named appropriately.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
common		common
language_classes		language_classes
BabelStrike.py		BabelStrike.py
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BabelStrike

Purpose

Video Presentation

Preview

Name to Usernames Convertion Rules

Table of rules for generating usernames:

Conversion rules when middle name is detected

Installation & Usage

Contributions

Instructions

About

Releases

Packages

Languages

License

natmulu/BabelStrike

Folders and files

Latest commit

History

Repository files navigation

BabelStrike

Purpose

Video Presentation

Preview

Name to Usernames Convertion Rules

Table of rules for generating usernames:

Conversion rules when middle name is detected

Installation & Usage

Contributions

Instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages