Drop column with no information #736

George3d6 · 2021-11-08T18:18:16Z

Dropping columns that don't have any information (a single unique value)

Currently no test for this since get_identifier_description has no tests (maybe todo for later) and this is so dumb it's not the first thing I'd want to test, see mindsdb/type_infer#31

paxcema

As discussed, may be worth moving these modules to a separate identification module, but as far as dropping uninformative columns this seems OK to me.

hakunanatasha · 2021-11-10T16:33:20Z

lightwood/helpers/text.py

@@ -210,7 +210,12 @@ def get_identifier_description_mp(arg_tup):

 def get_identifier_description(data, column_name, data_dtype):


type hint

data: Iterable, column_name: str, data_dtype: str

hakunanatasha · 2021-11-10T16:33:31Z

lightwood/helpers/text.py

@@ -210,7 +210,12 @@ def get_identifier_description_mp(arg_tup):

 def get_identifier_description(data, column_name, data_dtype):
    data = list(data)


why do we make this a list?

if it's already in a series form you can simply do something like len(data.unique())

hakunanatasha · 2021-11-10T16:33:59Z

lightwood/helpers/text.py

+    if nr_unique == 1:
+        return 'No Information'
+
+    unquie_pct = nr_unique / len(data)


fix the spelling

hakunanatasha · 2021-11-10T16:34:17Z

lightwood/helpers/text.py

+    if nr_unique == 1:
+        return 'No Information'
+
+    unquie_pct = nr_unique / len(data)

    spaces = [len(str(x).split(' ')) - 1 for x in data]
    mean_spaces = np.mean(spaces)


why are we doing this?

In order to mark the column for dropping under the "identifier / foreign-keys" rule, if this function returns a string (rather than None) the column is dropped, maybe the interface is confusing. And maybe we should broaden the definition of "identifier" to something like "no_information" ?

fix: dropping columns with no info

64fec8c

George3d6 added the enhancement New feature or request label Nov 8, 2021

paxcema approved these changes Nov 8, 2021

View reviewed changes

hakunanatasha suggested changes Nov 10, 2021

View reviewed changes

George3d6 added 2 commits November 10, 2021 21:36

fix: spelling

fda3556

feat: type hinting

93eb330

George3d6 merged commit cc3b08c into staging Nov 11, 2021

paxcema mentioned this pull request Nov 17, 2021

Release 1.7.0 #752

Merged

hamishfagg deleted the drop_no_inf_cols branch December 10, 2024 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop column with no information #736

Drop column with no information #736

George3d6 commented Nov 8, 2021

paxcema left a comment

hakunanatasha Nov 10, 2021

hakunanatasha Nov 10, 2021

hakunanatasha Nov 10, 2021

hakunanatasha Nov 10, 2021

hakunanatasha Nov 10, 2021

George3d6 Nov 11, 2021

		@@ -210,7 +210,12 @@ def get_identifier_description_mp(arg_tup):

		def get_identifier_description(data, column_name, data_dtype):

		@@ -210,7 +210,12 @@ def get_identifier_description_mp(arg_tup):

		def get_identifier_description(data, column_name, data_dtype):
		data = list(data)

Drop column with no information #736

Drop column with no information #736

Conversation

George3d6 commented Nov 8, 2021

paxcema left a comment

Choose a reason for hiding this comment

hakunanatasha Nov 10, 2021

Choose a reason for hiding this comment

hakunanatasha Nov 10, 2021

Choose a reason for hiding this comment

hakunanatasha Nov 10, 2021

Choose a reason for hiding this comment

hakunanatasha Nov 10, 2021

Choose a reason for hiding this comment

hakunanatasha Nov 10, 2021

Choose a reason for hiding this comment

George3d6 Nov 11, 2021

Choose a reason for hiding this comment