-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Notebooks Enhancement] Improve intuitiveness and performance of schema exploration #384
Comments
Adding to the new home of this issue for clarification on some of my language. I said that "getting back all those unique values will take time". What I meant there is that if you ask for too much data at once, it'll be slow (obvious, I know). So you should be scoping by time and then limiting the number of results. There's another thing that IOx can provide that I realize now might not be obvious (and may be not useful, but here it is). From the perspective of query functionality, there is no difference in IOx between a tag and a field. So that means you can do all the same operation son either. You can group by, filter results, look for unique values, limit on time, etc. I'm not sure if that changes any of your thinking on the UI, but I thought it was worth mentioning. |
@pauldix Yep, I was aware of that and that went into my thinking on it. I think the only thing this changes is how we talk about it (i.e., tooltips/documentation) and possibly how Proposal 2 would get implemented (since there is no idea of Fields mapped to Measurements, specifically). Out of curiosity, will there be any way for the storage engine to distinguish by what is a "metric" versus metadata that describes the metric (what we call Field and Tag, respectively, now)? In the docs I've read so far about IOX, there is still some distinction between "Tagset" and "Field"....but that could have changed. From the perspective of the user, I think what matters most is keys. Taking the most extreme obstacle we can come across (user knowing nothing about the data model), a user might want The only issue I see is if they wrote Fields with names like "disk_used_percent" and have thousands of data sources that emit that. The filter will display back thousands of matches so I think we'd need to find a way to suspend distinguishing those Field keys by their metadata (Tags) and only by Measurement/Table. |
There are a few things that are happening right now that are different between tag and field, but they're convenience for InfluxDB compatibility/ease of use. The in-memory write buffer database automatically dictionary encodes tags to save space for any repeated values. String fields are currently kept as is so if you have a string field with a bunch of repeated values, they'll take up more in-memory space than tags. However, once something gets down to Parquet and the Segment Store (optimized in-memory DB) there's literally no difference. Data will land in one or both of these places after it has been buffered in the previously mentioned write buffer DB. Periodically it will be flushed out to the read-only Segment or Parquet formats. The optimal encoding and compression for those will be chosen based on the shape of the data for each column (tag or field). That being said, we'll likely store metadata that specifies which columns were written in as InfluxDB tags and which as fields. However, there's no requirement that mapping exists. Sometime soon, we'll have another way to write data into IOx, which will just be flat JSON objects, which will make no distinction between tag or field. |
I echo what Paul is saying -- namely that the design of the underlying treatment for tags and fields is the same in IOx -- they are both columns. Tags will always be String given the Line protocol data model, and Fields can be Strings, Ints, Floats, or Bool. IOx will be using the knowledge that a certain column holds a Tag as a way to optimize the actual storage of these things (e.g. by using a Dictionary encoding for tags by default) One thing I don't fully understand about proposal 2 above is the use of tag value orders. For example, why is it
and not like this (swapping host01 and sda0)?
While exploring in the UI you typically drill down one tag at a time but the underlying data model doesn't really have any parent/child relationships between tags |
@alamb Totally -- I believe I mentioned it but tag order does not matter. I wanted to explicitly state that so if anyone were to eventually implement this, they wouldn't need to worry about that. The main relationship that matters is Fields-->Measurements. Not only does the Tag order not matter but the Tags themselves will not matter much when exploring data. The only reason I included them in the example was really to catch any cases where the user might want to see that context. For instance, I could imagine a user would be interested in knowing which metrics are available for a specific host (say, they're doing root cause analysis and they're coming from having reviewed logs for a particular host). In other words, I wanted to include the possibility of the case where a user types or copy/pastes some UUID to see what they can get out of that. |
I think the Data Explorer is awesome. I believe the community loves it as well. I only receives rave reviews. It would be fantastic if the Data Explorer could be embedded in a notebook cell. Even when the data is flat JSON objects with IoX (yay!), being able to search through the data in the Data Explorer is so easy, as the order of filtering for fields and tags doesn't matter with the Data Explorer. |
@samhld commented on Wed Dec 02 2020
Current behavior:
The data exploration ("Select a metric") cell type defaults to showing you a comprehensive list of all possible column values. This looks like the below gif (compared with the experience in the original schema explorer). Idk why it came out so slow:
Problem:
What we know:
data:image/s3,"s3://crabby-images/d9daa/d9daa1bc124f6b89ad6c51df208a52132d484833" alt="Screen Shot 2020-10-30 at 1 23 29 PM"
The Influx data model follows a logical form similar to the below:
From the perspective of a user and their access patterns, a query logically filters first by Measurement and then begins to seek matching Tags/columns. Similar to a SQL query, a user first determines which Table they will need to find their desired data. A flat representation of this concept seems likely to confuse users.
The expectation of this feature is that the user knows which metrics they want to select coming into this.
If users don't know anything about the Influx data model, they will need assistance with knowing:
Proposal 1:
Use the original Data Explorer for this cell. It is intuitive for users (I hear this a lot).
I know a known issue is that its underlying queries were expensive....but is that improved with this flat model? The above gif alludes to otherwise.
Also, I've heard talks about caching this information (which we absolutely should do) anyways.
Proposal 2:
Keep the basic idea of the current representation but manifest series keys as dot-notated keys that indicate relationships similar to the Line Protocol diagram above. Examples:
disk.us-west.host01.sda0.used_percent
system.us-east.host1002.disk_used
In the above series keys, the first line segment (top) is always the Measurement name. Every segment after that and before the last are Tags and order does not matter (this is not a document-/directory-style relationship after all). The last segment is the Field/"metric". The Tags are less meaningful when doing data exploration so they can be "shortened" with ellipses or some other notation...if/when needed. Perhaps they could be omitted entirely (I'll think on that more).
The benefit of this is there is an implied relationship between the objects. I believe this makes more sense to a user whether or not they are familiar with the Influx data model. The information missing would be what the line segments mean (easily tool-tipped) but the current flat representations has the same problem manifested slightly differently...and again, more confusingly IMO.
One of the reasons for the current flat representation was that users don't necessarily know the difference between Measurements and Fields so we needed a way to display all options when someone filtered for, say
disk
. Does the user want a table with a bunch ofdisk
metrics or simply adisk
metric...and do they know that upfront? This dot-notation also resolves that issue by displaying all the options but with more context.Proposal 3:
Don't display Tag values, just keys. This provides the user the ability to filter by Measurement name and other metadata. From there, they get a result set of Fields (values to the
_field
Tag, in this case). This is sort of a hybrid approach to the current state and the original Data Explorer depicted on the right side of the gif at the top of the issue.It, along with the first two proposals, satisfies the bullets in the "What we know" section as well as resolves the issues pointed to earlier.
@rbetts commented on Wed Dec 02 2020
As we think about metadata presentation in flows, also need to remember that we need a system for arbitrary cardinality. In the future user with a billion series and 100,000 measurements should be able to discover, understand, and navigate their data.
@pauldix commented on Wed Dec 02 2020
Given the underlying design of IOx, I think the number of measurements will be significantly less than 100k. Likely a few thousand that the most, otherwise the user should maybe be rethinking their schema design. I'm guessing that having 100k measurements will result in poor performance (although who knows).
Tag value cardinalities will effectively be unbounded, but getting back all those unique values will take time. Generally, when looking for tag values, it should be scoped by some time period, which is flexible, and then use
LIMIT
andOFFSET
to paginate through results.The text was updated successfully, but these errors were encountered: