how to work with duplicated field names? #225

aborruso · 2019-02-03T11:12:42Z

Hi,
I have this kind of input file (\t as field separator)

JA	Yea	FE	Yea	MA	Yea	AP	Yea	MA	Yea	JU	Yea	JU	Yea	AU	Yea	SEP	Yea	OC	Yea	NO	Yea	DE	Yea	WI	Yea	SP	Yea	SU	Yea	AU	Yea	AN	Year
3.6	1916	3.8	1998	4.7	1957	5.7	2011	7.6	2008	10.6	2017	12.5	2006	12.7	1997	11.3	2006	9.4	2001	6.4	1994	5.2	2015	3.03	1989	5.17	2014	11.34	2003	7.97	2011	6.30	2014
3.2	2007	3.1	1990	4.2	1938	5.3	2014	7.5	2017	10.1	2016	12.1	2003	12.2	2004	11.0	2016	8.8	2005	5.8	2011	4.8	1934	2.79	2007	5.00	2017	11.14	2006	7.95	2006	6.11	2006
3.1	1989	3.0	1961	3.8	1990	5.3	2007	7.5	1999	10.1	2003	12.1	1983	11.9	1975	10.5	1949	8.5	2006	5.3	2015	4.2	1988	2.78	1935	4.96	1999	10.99	2018	7.47	2014	6.03	2017
3.1	1921	3.0	1945	3.7	2017	5.0	1943	7.3	2014	10.0	1976	12.0	2018	11.8	2008	10.3	1999	8.4	1969	5.1	1997	4.1	1974	2.63	2016	4.80	2011	10.92	2016	7.31	2001	5.99	2002
2.8	1990	2.9	1926	3.6	1997	4.8	2018	7.3	1964	9.9	2014	11.9	2013	11.8	1995	10.2	1958	8.4	1968	5.1	1938	3.7	1971	2.63	1998	4.78	2007	10.84	2004	7.06	2009	5.98	2011
2.7	1983	2.9	1918	3.5	1998	4.8	2004	7.2	1998	9.9	2007	11.9	1991	11.7	2003	10.1	2011	8.3	2013	5.0	2002	3.7	1953	2.57	1975	4.72	1992	10.82	1997	7.03	1978	5.97	2004
2.6	1932	2.6	2017	3.4	2012	4.8	1961	7.2	1952	9.9	2004	11.8	1995	11.7	1990	9.9	2005	8.2	2017	4.9	1953	3.7	1924	2.48	2014	4.62	1998	10.73	2017	6.98	1949	5.95	2007
2.5	2008	2.6	2011	3.4	1981	4.7	1993	7.1	1992	9.8	2018	11.8	1933	11.6	2002	9.9	2000	8.2	1995	4.8	2014	3.3	2018	2.34	1925	4.55	1952	10.70	1995	6.97	2005	5.88	2005
2.5	2005	2.5	2014	3.4	1961	4.6	2009	7.0	1970	9.8	2005	11.6	2010	11.6	1955	9.9	1998	8.0	2011	4.6	2009	3.2	1942	2.32	1923	4.54	1961	10.70	1933	6.96	1995	5.85	1999

As you see I have several time the same field name: Yea.

If I try to simply cat it, Miller gives as output only one Yea field

$ mlr --tsv cat input

JA  Yea  FE  MA  AP  JU   AU   SEP  OC  NO  DE  WI   SP   SU    AN   Year
3.6 2011 3.8 7.6 5.7 12.5 7.97 11.3 9.4 6.4 5.2 3.03 5.17 11.34 6.30 2014
3.2 2006 3.1 7.5 5.3 12.1 7.95 11.0 8.8 5.8 4.8 2.79 5.00 11.14 6.11 2006
3.1 2014 3.0 7.5 5.3 12.1 7.47 10.5 8.5 5.3 4.2 2.78 4.96 10.99 6.03 2017
3.1 2001 3.0 7.3 5.0 12.0 7.31 10.3 8.4 5.1 4.1 2.63 4.80 10.92 5.99 2002
2.8 2009 2.9 7.3 4.8 11.9 7.06 10.2 8.4 5.1 3.7 2.63 4.78 10.84 5.98 2011
2.7 1978 2.9 7.2 4.8 11.9 7.03 10.1 8.3 5.0 3.7 2.57 4.72 10.82 5.97 2004
2.6 1949 2.6 7.2 4.8 11.8 6.98 9.9  8.2 4.9 3.7 2.48 4.62 10.73 5.95 2007
2.5 2005 2.6 7.1 4.7 11.8 6.97 9.9  8.2 4.8 3.3 2.34 4.55 10.70 5.88 2005
2.5 1995 2.5 7.0 4.6 11.6 6.96 9.9  8.0 4.6 3.2 2.32 4.54 10.70 5.85 1999

Is there a way to manage in Miller this kind od weird input?

Thank you

The text was updated successfully, but these errors were encountered:

aborruso · 2019-02-03T12:15:03Z

A DSL solution

mlr --inidx --ifs '\t' --ocsv put -S 'counter=1; for (k,v in $*) {if (v =~ "Ye"){$[k]=v.counter;counter += 1;}}' then cat  input | tail -n +2 | mlr --c2p cat

gives me

JA  Yea1 FE  Yea2 MA  Yea3 AP  Yea4 Yea5 JU   Yea6 Yea7 AU   Yea8 SEP  Yea9 OC  Yea10 NO  Yea11 DE  Yea12 WI   Yea13 SP   Yea14 SU    Yea15 Yea16 AN   Year17
3.6 1916 3.8 1998 7.6 1957 5.7 2011 2008 12.5 2017 2006 7.97 1997 11.3 2006 9.4 2001  6.4 1994  5.2 2015  3.03 1989  5.17 2014  11.34 2003  2011  6.30 2014
3.2 2007 3.1 1990 7.5 1938 5.3 2014 2017 12.1 2016 2003 7.95 2004 11.0 2016 8.8 2005  5.8 2011  4.8 1934  2.79 2007  5.00 2017  11.14 2006  2006  6.11 2006
3.1 1989 3.0 1961 7.5 1990 5.3 2007 1999 12.1 2003 1983 7.47 1975 10.5 1949 8.5 2006  5.3 2015  4.2 1988  2.78 1935  4.96 1999  10.99 2018  2014  6.03 2017
3.1 1921 3.0 1945 7.3 2017 5.0 1943 2014 12.0 1976 2018 7.31 2008 10.3 1999 8.4 1969  5.1 1997  4.1 1974  2.63 2016  4.80 2011  10.92 2016  2001  5.99 2002
2.8 1990 2.9 1926 7.3 1997 4.8 2018 1964 11.9 2014 2013 7.06 1995 10.2 1958 8.4 1968  5.1 1938  3.7 1971  2.63 1998  4.78 2007  10.84 2004  2009  5.98 2011
2.7 1983 2.9 1918 7.2 1998 4.8 2004 1998 11.9 2007 1991 7.03 2003 10.1 2011 8.3 2013  5.0 2002  3.7 1953  2.57 1975  4.72 1992  10.82 1997  1978  5.97 2004
2.6 1932 2.6 2017 7.2 2012 4.8 1961 1952 11.8 2004 1995 6.98 1990 9.9  2005 8.2 2017  4.9 1953  3.7 1924  2.48 2014  4.62 1998  10.73 2017  1949  5.95 2007
2.5 2008 2.6 2011 7.1 1981 4.7 1993 1992 11.8 2018 1933 6.97 2002 9.9  2000 8.2 1995  4.8 2014  3.3 2018  2.34 1925  4.55 1952  10.70 1995  2005  5.88 2005
2.5 2005 2.5 2014 7.0 1961 4.6 2009 1970 11.6 2005 2010 6.96 1955 9.9  1998 8.0 2011  4.6 2009  3.2 1942  2.32 1923  4.54 1961  10.70 1933  1995  5.85 1999

johnkerl · 2019-02-03T17:35:18Z

So Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as sort, cut, awk, etc.

So given input Yea=1,Yea=2 on the same input line, first Yea=1 is stored, then updated with Yea=2. This is in the input-parser and the value Yea=1 is unavailable to any further processing.

The nidx trick is a nice one. I'll put your solution in the FAQ (and credit you of course). :)

aborruso · 2019-02-03T17:53:02Z

Hi @johnkerl thank you.

I have used the nidx trick to solve the data reading of a very weird (for me) input file, the one for which I have asked about fixed with issue.

And I have written this long sh script https://gist.github.com/aborruso/41e825a0bd30649c9341b04347304b39#file-00_uk-tmin-sh

johnkerl · 2019-02-03T18:05:29Z

7870562 adds a mlr cat -v option which does an internal record-dump to stderr:

$ echo 'a=1,b=2,b=3,c=4' | mlr cat -v 2> err.txt
a=1,b=3,c=4

$ cat err.txt
field_count = 3
| phead:   0x7fdb91c04320 | ptail   0x7fdb91c04380
| prev:              0x0 curr:   0x7fdb91c04320 next:   0x7fdb91c04350 | key:            a | value:            1 |
| prev:   0x7fdb91c04320 curr:   0x7fdb91c04350 next:   0x7fdb91c04380 | key:            b | value:            3 |
| prev:   0x7fdb91c04350 curr:   0x7fdb91c04380 next:              0x0 | key:            c | value:            4 |
NULL

aborruso · 2019-02-03T18:39:37Z

@johnkerl than to check that there is a problem I can compare field_count output with input files fields number?

Thank you

johnkerl · 2019-02-24T18:15:53Z

@aborruso that sounds good.

another idea here is i can add a command-line flag for the record-ingestor to add suffixes such as _1, _2 to field names when they appear duplicated within a given record.

aborruso · 2019-02-24T18:51:07Z

@johnkerl it's great for me!

johnkerl · 2021-12-23T01:54:52Z

@aborruso #794

aborruso · 2021-12-23T07:09:19Z

Great @johnkerl !!

johnkerl added the active label Feb 3, 2019

johnkerl added on deck and removed active labels Jun 2, 2019

aborruso mentioned this issue May 27, 2021

Document mlr join --no-implicit-csv-header override #524

Closed

johnkerl mentioned this issue Aug 26, 2021

[feature request] Input (and maybe) output format handling multi-valued keys #635

Open

johnkerl mentioned this issue Dec 20, 2021

How to reference field names by index #785

Closed

johnkerl added active and removed on deck labels Dec 20, 2021

johnkerl mentioned this issue Dec 23, 2021

Dedupe field names by default #794

Merged

aborruso closed this as completed Dec 23, 2021

johnkerl mentioned this issue Jul 6, 2022

A wrong CSV mapped as valid? #1050

Closed

johnkerl removed the active label Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to work with duplicated field names? #225

how to work with duplicated field names? #225

aborruso commented Feb 3, 2019

aborruso commented Feb 3, 2019

johnkerl commented Feb 3, 2019

aborruso commented Feb 3, 2019

johnkerl commented Feb 3, 2019

aborruso commented Feb 3, 2019

johnkerl commented Feb 24, 2019

aborruso commented Feb 24, 2019

johnkerl commented Dec 23, 2021

aborruso commented Dec 23, 2021

how to work with duplicated field names? #225

how to work with duplicated field names? #225

Comments

aborruso commented Feb 3, 2019

aborruso commented Feb 3, 2019

johnkerl commented Feb 3, 2019

aborruso commented Feb 3, 2019

johnkerl commented Feb 3, 2019

aborruso commented Feb 3, 2019

johnkerl commented Feb 24, 2019

aborruso commented Feb 24, 2019

johnkerl commented Dec 23, 2021

aborruso commented Dec 23, 2021