Skip to content
This repository was archived by the owner on Mar 6, 2019. It is now read-only.

Additional Tips

joernpreuss edited this page Dec 2, 2018 · 15 revisions

Introduction

This page collects some additional twecoll examples and handy data processing tips.

Using Init Queries

Twecoll can collect tweets against a hashtag or some string using the query option as follows.

$ twecoll init -q "#somehashtag"

That will generate the DAT file to run the second fetch step. But it is also possible to query arbitrary handles by passing a file you create with your favorite editor. See the Arbitrary Sets page. There, I create a file with arbitrary selected handles to explore how a set of given organizations connect with each other.

Randomized Sets

Twecoll uses simple files to store its data. If you have access to GNU/Linux utilities such as grep, sort, or head you can perform additional steps in data preparation.

For example, the snippet below samples 100 handles from a large .dat file collected via init.

$ sort -R input.dat | head -n 100 >output.dat

Mixing Sets

Twecoll supports mixing multiple sets of handles. The following example shows how to visualize the overlap between two handles oazanon and lemanamana.

First, let's retrieve the data.

$ twecoll init oazanon

$ twecoll fetch -c 20000 oazanon

$ twecoll init lemanamana

$ twecoll fetch -c 20000 lemanamana

Now, we can generate the combined GML file of handles from oazanon and lemanamana as follows.

$ twecoll edgelist oazanon lemanamana

Twecoll recognizes that multiple handles are passed to edgelist. In that case, shapes are assigned to each handle instead of the usual up and down triangles. If edgelist is called with the strong ties switch (-s), community coloring is preserved, otherwise a unique color is assigned to each handle and edges remain grey. The GML file captures the source .dat file for each node.

In the resulting diagram, yellow circles represent the second handle (lemanamana's) and red squares the first handle (oazanon's). Black round circles are used for handles of lemanamana which exceeded the fetch count (> 20000 friends).

Diff'ing Sets

I saved some handles on urban intelligence in a list and noticed a similar "smart cities" list from someone I follow, namely @dr_rick. How can I find out the list of handles which I have that @dr_rick doesn't have?

First, let's retrieve the data.

$ twecoll init -m smart-cities dr_rick

$ twecoll init -m urbanintel jdevoo

Now, let's use AWK to find out the answer to the question above.

awk -F ',' 'NR==FNR {m[$1]++; next} !m[$1]' dr_rick.dat jdevoo.dat

This tells AWK to use comma as field separator and build a hashmap from the identifiers found in the first file (dr_rick.dat). The value in the count will not be used. The final step is to print out lines for which the identifier $1 in the second file is not found in the map.

Clone this wiki locally