Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement company type simplification #20

Open
pudo opened this issue Nov 21, 2024 · 1 comment
Open

Implement company type simplification #20

pudo opened this issue Nov 21, 2024 · 1 comment

Comments

@pudo
Copy link
Member

pudo commented Nov 21, 2024

Right now, fingerprints can only remove company type information from a company name, or generate a shortened form on a very simplified string: Siemens Aktiengesellschaft -> ag siemens. I'd like to expand that functionality to:

a. Enable the simplification/rewrite of long company types such that they can still be shown to the user afterwards (Siemens Aktiengesellschaft -> Siemens AG)
b. Use that same mapping database to do the strong normalization, perhaps including the ability to choose how "generic" to make the re-write. For example, the Russian company type OOO is sometimes normalised to LLC, which is sort of a radical simplification we could keep as "Level 2" and make optional.
c. Have an option to generate simplified company names with stopwords normalised ("Company", "International", etc.)

@pudo
Copy link
Member Author

pudo commented Nov 21, 2024

Here's a bit of a brainstorm on what a metadata file could look like that enables some of this. It would be minimally normalised for display re-writes, and then we could also generate the current contents of types.yml from it:

person_name_prefixes:
  - "Mr"
  - "Ms"
  - "Mrs"
  - "Mister"
  - "Miss"
  - "Madam"
  - "Madame"
  - "Monsieur"
  - "Honorable"
  - "Honourable"
  - "Mme"
  - "Mmme"
  - "Herr"
  - "Hr"
  - "Frau"
  - "Fr"
  - "The"
  - "Fräulein"
  - "Senor"
  - "Senorita"
  - "Sheik"
  - "Sheikh"
  - "Shaikh"
  - "Sr"
  - "Sir"
  - "Lady"
  - "The"
basic_stopwords:
  - "de"
  - "of"
  - "and"
  - "&"
company_stopwords:
  - Company
  - Business
  - Management
  - International
  - Intl
  - Corporation
  - Corp
  - Fund
  - Holding
  - Holdings
  - Trading
  - Import
  - Export
  - Trust
  - Services
  - Industries
  - Consulting
  - Partner
  - Partners
  - Solutions
  - Group
  - Foundation
  # - Fdn
  - Commercial
company_stopwords_broad:
  - Development
  - Financial
  - Investment
  - Investments
company_types:
  - simple: GmbH
    broader: Ltd
    alias:
      - Gesellschaft mit beschränkter Haftung
  - simple: GmbH & Co. KG
    broader: GmbH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant