Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mbox takes any sentence starting by ^From as a new message #20

Open
gpoo opened this issue Mar 17, 2016 · 8 comments
Open

mbox takes any sentence starting by ^From as a new message #20

gpoo opened this issue Mar 17, 2016 · 8 comments

Comments

@gpoo
Copy link

gpoo commented Mar 17, 2016

The mbox backend parses anything starting with From as a new message. Therefore, the following message from OpenStack will be taken as 2 messages:

From eric at cloudscaling.com  Thu Aug  9 18:39:17 2012
From: eric at cloudscaling.com (Eric Windisch)
Date: Thu, 9 Aug 2012 14:39:17 -0400
Subject: [openstack-dev] [Openstack] Making the RPC backend a required
 configuration parameter
In-Reply-To: <[email protected]>
References: <[email protected]>
 <[email protected]>
Message-ID: <[email protected]>


>  
> I also don't understand why having a default that doesn't work for
> anyone makes any sense.
>  
I would hope that a localhost only installation with a username and password of 'guest' include a very small number of anyones. Who is really using a completely stock, default configuration successfully, and do they really care?  Everyone else is using configuration management of sorts, at which point this discussion is moot. Even devstack changes this configuration.

If someone really is installing Nova from scratch and using a default configuration? Does setuptools also install RabbitMQ for you?  No?  Right, you need to read documentation and recognize that RabbitMQ needs to be installed.  Sure, once it is installed, no configuration is required; Unless you're /actually/ going to use it, of course.

From everything I've seen, the general recommendation on the mailing list for those installing Nova on a single node is to use devstack. In that case, the configuration is prompt-driven, and whatever changes need to be made, can be made.

Regards,
Eric Windisch

The main issue is that leaves the message truncated. In this example, the last paragraph and signature will be lost.

If the purpose is to only parse metadata, the approach is ok. Although there would not be reason to store the body of the message.

@sduenas sduenas added the bug label Mar 17, 2016
@sduenas
Copy link
Member

sduenas commented Mar 17, 2016

Yes, this is a known bug (see db40e3a). This is an error that comes from the current implementation of mbox in the Python's standard library.

https://docs.python.org/3.5/library/mailbox.html#mbox
https://github.com/python/cpython/blob/master/Lib/mailbox.py#L859

In mlstats we use another class to parse mbox files but it was removed in Python 3. I thought to copy its implementation to perceval following what you said here but it has a lot of dependencies with other modules, so I choose the fastest solution.

Any idea is welcome.

@gpoo
Copy link
Author

gpoo commented Mar 17, 2016

The same issue that in 2.x series for mbox. I thought you had solved it here :-)

I think the actual issue is not checking if the next line is a header. That would reduce the odds of having issues. Although, still it may fail in a discussion about email, if somebody pastes header + body as part of the text ;-)

@jgbarah
Copy link
Contributor

jgbarah commented Mar 17, 2016

What about inspecting messages right after being identified as such by the mailbox library, looking for those mandatory headers? If those headers are not found, the message could be safely added to the body of the "previous" message, since it was not really a message.

I guess in the following code in perceval/backends/mbox.py

                 for message in self.parse_mbox(tmp_path):
                    if not self._validate_message(message):
                        continue
                     yield message

it would be a matter of, instead of continuing, adding to the body of the previous message.

Granted, this would force to defer the yield until the next message is analyzed, which could make the code a bit more complex...

If you want I can have it a try to see if I can produce a PR for this...

@sduenas
Copy link
Member

sduenas commented Mar 18, 2016

What about inspecting messages right after being identified as such by the mailbox library, looking for those mandatory headers? If those headers are not found, the message could be safely added to the body of the "previous" message, since it was not really a message.

The problem is there are messages that do not include some mandatory headers, such as Message-ID, but they are messages. This means you'll be joining two messages in one.

Granted, this would force to defer the yield until the next message is analyzed, which could make the code a bit more complex...
If you want I can have it a try to see if I can produce a PR for this...

That's my concern but if you find a clean way to do it, go ahead. Take into account that this is a problem related to the mbox format itself and for messages that are not multipart, so it would make sense to add this behaviour to the parser instead to the code you pasted.

@sduenas sduenas added the mbox label Mar 18, 2016
@gpoo
Copy link
Author

gpoo commented Mar 18, 2016

However, if the next line is not a header (^\w: something), we know that it is not a new message. If the next line looks like a header and it is not, well, you would discard it anyway with the current behavior.

It will fail in other cases, but not in the ones that people start a sentence with From.

And yes, the implementation of mbox in Python is naive, slow and bogus. It is unfortunate there is not a good mbox parser in Python like in other languages.

@gpoo
Copy link
Author

gpoo commented Mar 18, 2016

I wonder how many mboxes out there do not have the fomat:

^From\s[some text]\s\s[Date]$

@gpoo
Copy link
Author

gpoo commented Mar 18, 2016

I took all my compressed files to compare false positives.

$ find .mlstats/compressed -type f | wc -l
16194
$ find .mlstats/compressed -type f -exec zgrep '^From ' {} \; > from-sample.txt

From some reason I did not investigate, zgrep's output included some lines without '^From', so I cleaned them up:

$ egrep '^From ' from-sample.txt > from-sample-s.txt
$ wc  -l from-sample-s.txt
2604383 from-sample-s.txt

From there, I explored which lines did not end on a year and did not have a time. I removed the ones with @, nobody, MAILER-DAEMON and ' at ', because those were repetitive. And I checked them manually. I did not find a single line that would not be a beginning of email.

$ egrep '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' from-sample-s.txt | egrep -v '(@|nobody|MAILER-DAEMON| at )' | wc -l
454

So, I could say, that any line that does not match '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' is text. So, the next thing to check is how many false positives mbox will give:

$ egrep -v '[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' from-sample-s.txt | wc -l
3186

Not that many, to be honest. But those would be the double, because it will truncate messages where these lines belong to.

Well, I narrowed the list to check it manually and I put them in a file errors.txt. I am pasting some of them:

$ sort errors.txt | uniq -c | sort -n
      6 From http://www.osnews.com/story.php?news_id=421&page=5 :
      7 From the prior art department: compare with RFC 2026.
      8 From 10.81.104.254 icmp_seq=1 Frag needed and DF set (mtu = 1450)
     10 From config.log:
     10 From the changelog:
     12 From Hagakure, The Book of the Samurai
     15 From 1 to the square root of 3        |    directly, unless asked |
     47 From Darkness, Lead me towards the Light,
     47 From Death, Lead me to Life Eternal
     47 From Untruth, lead me to the Truth,
     54 From WE to YOU, with love:

That said:

  1. If the line that does not match '^From .*[0-9]{2}:[0-9]{2}:[0-9]{2}.*[0-9]{4}$' is potentially just text.
  2. To reduce the odds, the second line should not look like a header (something like '^[A-Za-z\-]: ')

You can check with your own datasets.

@sduenas
Copy link
Member

sduenas commented Mar 21, 2016

Thanks @gpoo. I will try the same with some sets.

In the meantime, do you know any other good mbox parser in other languages, like C, Java, Perl whatever? I'm curious to know how they fix this issue because I think it's a problem more related to the mbox format itself than to the parser. If someone fixed this in other languages we can apply the same techniques or even send a patch to the standard library to fix it.

valeriocos pushed a commit to valeriocos/perceval that referenced this issue Dec 6, 2017
This patch adds the to_unicode() function to avoid encoding
problems related to ascii and unicode conversions. This function
is required by uuid() function.

Fixes chaoss#20
valeriocos pushed a commit to valeriocos/perceval that referenced this issue Dec 6, 2017
This patch sets the system encoding to UTF-8 using setdefaultencoding()
from sys module. It is a little hacky but really useful while we
moving the code to Python 3.

Fixes chaoss#20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants