Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to work with duplicated field names? #225

Closed
aborruso opened this issue Feb 3, 2019 · 9 comments
Closed

how to work with duplicated field names? #225

aborruso opened this issue Feb 3, 2019 · 9 comments

Comments

@aborruso
Copy link
Contributor

aborruso commented Feb 3, 2019

Hi,
I have this kind of input file (\t as field separator)

JA	Yea	FE	Yea	MA	Yea	AP	Yea	MA	Yea	JU	Yea	JU	Yea	AU	Yea	SEP	Yea	OC	Yea	NO	Yea	DE	Yea	WI	Yea	SP	Yea	SU	Yea	AU	Yea	AN	Year
3.6	1916	3.8	1998	4.7	1957	5.7	2011	7.6	2008	10.6	2017	12.5	2006	12.7	1997	11.3	2006	9.4	2001	6.4	1994	5.2	2015	3.03	1989	5.17	2014	11.34	2003	7.97	2011	6.30	2014
3.2	2007	3.1	1990	4.2	1938	5.3	2014	7.5	2017	10.1	2016	12.1	2003	12.2	2004	11.0	2016	8.8	2005	5.8	2011	4.8	1934	2.79	2007	5.00	2017	11.14	2006	7.95	2006	6.11	2006
3.1	1989	3.0	1961	3.8	1990	5.3	2007	7.5	1999	10.1	2003	12.1	1983	11.9	1975	10.5	1949	8.5	2006	5.3	2015	4.2	1988	2.78	1935	4.96	1999	10.99	2018	7.47	2014	6.03	2017
3.1	1921	3.0	1945	3.7	2017	5.0	1943	7.3	2014	10.0	1976	12.0	2018	11.8	2008	10.3	1999	8.4	1969	5.1	1997	4.1	1974	2.63	2016	4.80	2011	10.92	2016	7.31	2001	5.99	2002
2.8	1990	2.9	1926	3.6	1997	4.8	2018	7.3	1964	9.9	2014	11.9	2013	11.8	1995	10.2	1958	8.4	1968	5.1	1938	3.7	1971	2.63	1998	4.78	2007	10.84	2004	7.06	2009	5.98	2011
2.7	1983	2.9	1918	3.5	1998	4.8	2004	7.2	1998	9.9	2007	11.9	1991	11.7	2003	10.1	2011	8.3	2013	5.0	2002	3.7	1953	2.57	1975	4.72	1992	10.82	1997	7.03	1978	5.97	2004
2.6	1932	2.6	2017	3.4	2012	4.8	1961	7.2	1952	9.9	2004	11.8	1995	11.7	1990	9.9	2005	8.2	2017	4.9	1953	3.7	1924	2.48	2014	4.62	1998	10.73	2017	6.98	1949	5.95	2007
2.5	2008	2.6	2011	3.4	1981	4.7	1993	7.1	1992	9.8	2018	11.8	1933	11.6	2002	9.9	2000	8.2	1995	4.8	2014	3.3	2018	2.34	1925	4.55	1952	10.70	1995	6.97	2005	5.88	2005
2.5	2005	2.5	2014	3.4	1961	4.6	2009	7.0	1970	9.8	2005	11.6	2010	11.6	1955	9.9	1998	8.0	2011	4.6	2009	3.2	1942	2.32	1923	4.54	1961	10.70	1933	6.96	1995	5.85	1999

As you see I have several time the same field name: Yea.

If I try to simply cat it, Miller gives as output only one Yea field

$ mlr --tsv cat input

JA  Yea  FE  MA  AP  JU   AU   SEP  OC  NO  DE  WI   SP   SU    AN   Year
3.6 2011 3.8 7.6 5.7 12.5 7.97 11.3 9.4 6.4 5.2 3.03 5.17 11.34 6.30 2014
3.2 2006 3.1 7.5 5.3 12.1 7.95 11.0 8.8 5.8 4.8 2.79 5.00 11.14 6.11 2006
3.1 2014 3.0 7.5 5.3 12.1 7.47 10.5 8.5 5.3 4.2 2.78 4.96 10.99 6.03 2017
3.1 2001 3.0 7.3 5.0 12.0 7.31 10.3 8.4 5.1 4.1 2.63 4.80 10.92 5.99 2002
2.8 2009 2.9 7.3 4.8 11.9 7.06 10.2 8.4 5.1 3.7 2.63 4.78 10.84 5.98 2011
2.7 1978 2.9 7.2 4.8 11.9 7.03 10.1 8.3 5.0 3.7 2.57 4.72 10.82 5.97 2004
2.6 1949 2.6 7.2 4.8 11.8 6.98 9.9  8.2 4.9 3.7 2.48 4.62 10.73 5.95 2007
2.5 2005 2.6 7.1 4.7 11.8 6.97 9.9  8.2 4.8 3.3 2.34 4.55 10.70 5.88 2005
2.5 1995 2.5 7.0 4.6 11.6 6.96 9.9  8.0 4.6 3.2 2.32 4.54 10.70 5.85 1999

Is there a way to manage in Miller this kind od weird input?

Thank you

@aborruso
Copy link
Contributor Author

aborruso commented Feb 3, 2019

A DSL solution

mlr --inidx --ifs '\t' --ocsv put -S 'counter=1; for (k,v in $*) {if (v =~ "Ye"){$[k]=v.counter;counter += 1;}}' then cat  input | tail -n +2 | mlr --c2p cat

gives me

JA  Yea1 FE  Yea2 MA  Yea3 AP  Yea4 Yea5 JU   Yea6 Yea7 AU   Yea8 SEP  Yea9 OC  Yea10 NO  Yea11 DE  Yea12 WI   Yea13 SP   Yea14 SU    Yea15 Yea16 AN   Year17
3.6 1916 3.8 1998 7.6 1957 5.7 2011 2008 12.5 2017 2006 7.97 1997 11.3 2006 9.4 2001  6.4 1994  5.2 2015  3.03 1989  5.17 2014  11.34 2003  2011  6.30 2014
3.2 2007 3.1 1990 7.5 1938 5.3 2014 2017 12.1 2016 2003 7.95 2004 11.0 2016 8.8 2005  5.8 2011  4.8 1934  2.79 2007  5.00 2017  11.14 2006  2006  6.11 2006
3.1 1989 3.0 1961 7.5 1990 5.3 2007 1999 12.1 2003 1983 7.47 1975 10.5 1949 8.5 2006  5.3 2015  4.2 1988  2.78 1935  4.96 1999  10.99 2018  2014  6.03 2017
3.1 1921 3.0 1945 7.3 2017 5.0 1943 2014 12.0 1976 2018 7.31 2008 10.3 1999 8.4 1969  5.1 1997  4.1 1974  2.63 2016  4.80 2011  10.92 2016  2001  5.99 2002
2.8 1990 2.9 1926 7.3 1997 4.8 2018 1964 11.9 2014 2013 7.06 1995 10.2 1958 8.4 1968  5.1 1938  3.7 1971  2.63 1998  4.78 2007  10.84 2004  2009  5.98 2011
2.7 1983 2.9 1918 7.2 1998 4.8 2004 1998 11.9 2007 1991 7.03 2003 10.1 2011 8.3 2013  5.0 2002  3.7 1953  2.57 1975  4.72 1992  10.82 1997  1978  5.97 2004
2.6 1932 2.6 2017 7.2 2012 4.8 1961 1952 11.8 2004 1995 6.98 1990 9.9  2005 8.2 2017  4.9 1953  3.7 1924  2.48 2014  4.62 1998  10.73 2017  1949  5.95 2007
2.5 2008 2.6 2011 7.1 1981 4.7 1993 1992 11.8 2018 1933 6.97 2002 9.9  2000 8.2 1995  4.8 2014  3.3 2018  2.34 1925  4.55 1952  10.70 1995  2005  5.88 2005
2.5 2005 2.5 2014 7.0 1961 4.6 2009 1970 11.6 2005 2010 6.96 1955 9.9  1998 8.0 2011  4.6 2009  3.2 1942  2.32 1923  4.54 1961  10.70 1933  1995  5.85 1999

@johnkerl
Copy link
Owner

johnkerl commented Feb 3, 2019

So Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as sort, cut, awk, etc.

So given input Yea=1,Yea=2 on the same input line, first Yea=1 is stored, then updated with Yea=2. This is in the input-parser and the value Yea=1 is unavailable to any further processing.

The nidx trick is a nice one. I'll put your solution in the FAQ (and credit you of course). :)

@johnkerl johnkerl added the active label Feb 3, 2019
@aborruso
Copy link
Contributor Author

aborruso commented Feb 3, 2019

Hi @johnkerl thank you.

I have used the nidx trick to solve the data reading of a very weird (for me) input file, the one for which I have asked about fixed with issue.

And I have written this long sh script https://gist.github.com/aborruso/41e825a0bd30649c9341b04347304b39#file-00_uk-tmin-sh

@johnkerl
Copy link
Owner

johnkerl commented Feb 3, 2019

7870562 adds a mlr cat -v option which does an internal record-dump to stderr:

$ echo 'a=1,b=2,b=3,c=4' | mlr cat -v 2> err.txt
a=1,b=3,c=4

$ cat err.txt
field_count = 3
| phead:   0x7fdb91c04320 | ptail   0x7fdb91c04380
| prev:              0x0 curr:   0x7fdb91c04320 next:   0x7fdb91c04350 | key:            a | value:            1 |
| prev:   0x7fdb91c04320 curr:   0x7fdb91c04350 next:   0x7fdb91c04380 | key:            b | value:            3 |
| prev:   0x7fdb91c04350 curr:   0x7fdb91c04380 next:              0x0 | key:            c | value:            4 |
NULL

@aborruso
Copy link
Contributor Author

aborruso commented Feb 3, 2019

@johnkerl than to check that there is a problem I can compare field_count output with input files fields number?

Thank you

@johnkerl
Copy link
Owner

@aborruso that sounds good.

another idea here is i can add a command-line flag for the record-ingestor to add suffixes such as _1, _2 to field names when they appear duplicated within a given record.

@aborruso
Copy link
Contributor Author

@johnkerl it's great for me!

@johnkerl
Copy link
Owner

@aborruso #794

@aborruso
Copy link
Contributor Author

Great @johnkerl !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants