TweeQL makes interacting with the Twitter API as easy as SQL makes it to interact with relational data!
For example, the TweeQL query SELECT brand(text) AS brand, sentiment(text) AS sentiment FROM twitter_sample;
will run two
user-defined functions, brand
and sentiment
to extract brand names and
expressed user sentiment from the text of every tweet that passes through
the Twitter Streaming API. All this, without having to worry about the details
of implementing the API correctly.
Mailing List: Google Group---http://groups.google.com/group/tweeql
IRC: freenode.net #tweeql (web interface)
There are two options for installing TweeQL:
- easy_install/pip. Something like
sudo easy_install tweeql
orsudo pip install tweeql
at the command line should suffice. - from github. After checking out this repository, run
sudo python setup.py install
to install! Note: if you do this in Ubuntu/Debian, typesudo apt-get install python-setuptools
first.
Installing Python in Windows:
- Download and Install pyreadline using a suitable installer from here: https://pypi.python.org/pypi/pyreadline/2.0
- run
python setup.py install
on the windows command line.
TweeQL requires a settings.py
file in your current working directory before
you can use it. The simplest way to get a working settings.py
file is to get the template into your current directory:
curl https://raw.github.com/marcua/tweeql/master/settings.py.template > settings.py
This file should work without editing. You can edit it to provide a username/password for the streaming API, or to change the database into which extracted data will be inserted. Edit settings.py to include the username of the account you created at GeoNames.
After installing TweeQL, you should be able to run the TweeQL command line by typing tweeql-command-line.py
:
marcua@marcua-x60:~$ tweeql-command-line.py
Check if CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, and ACCESS_TOKEN_SECRET are defined in settings.py
Consumer key: XXX
Consumer secret:
Access token: XXX
Access token secret:
tweeql>
Once at the tweeql>
command line, you can enter SQL-like queries, which will hit the Twitter Streaming API.
To see Tweets from the Twitter Spritzer stream, type SELECT text FROM twitter_sample;
:
tweeql> SELECT text from twitter_sample;
não sei realmente o que decidir da minha vida.. muito confusa. :S
伝線した黒ストッキングの使い道は…
...
These should stream by your screen relatively quickly. Hit ctrl+c
to stop the query. To exit TweeQL, hit ctrl+c
again.
Now we'll walk through example queries to teach you the TweeQL syntax. Check out examples.txt for a longer list of example queries that you can run in TweeQL.
TweeQL borrows its query syntax from SQL, and so queries are of the form
SELECT field1, field2 FROM streams WHERE filter_conditions GROUP BY field3, field4 WINDOW x seconds
These keywords work just like in SQL, except for the WINDOW key word, which applies to GROUP BY. Because Twitter offers an infinite stream of data, aggregates which use GROUP BY would never emit any groups. WINDOW x seconds
tells TweeQL to emit the aggregates in the GROUP BY statement every x seconds, providing a rolling window over which to aggregate groups.
Our first TweeQL query was SELECT text from twitter_sample;
. TWITTER_SAMPLE
is a stream known in the Twitter API as the Spritzer stream: it is a sample of tweets that constitutes approximately 1% of Twitter's firehose. You can issue queries to TWITTER_SAMPLE
without any filter conditions, though you are free to further filter this stream.
If you wish to access more than a sample of the tweets on the stream, you have to query TWITTER
rather than TWITTER_SAMPLE
. To avoid excessive costs, Twitter requires that you issue at least one filter condition along with your query. For example, you can get all tweet text for tweets that contain the word 'obama' by issuing the following query
SELECT text FROM twitter WHERE text contains 'obama';
You can also filter the results by tracking individual users or by querying geographic regions, though this is not currently implemented in TweeQL at the moment (e-mail me if this is important to you).
Tweets have several fields which one can filter, select, or aggregate across. The fields are currently text
(tweet text), location
(user-defined location, like 'Boston' or 'France, Earth'), lang
(user-specified language), profile_image_url
(the URL of the user's profile image), user_id
(the user's Twitter userid), screen_name
(the user's username), and created_at
(the time of the tweet). You can add more by editing tweeql/twitter_fields.py
.
Fields can appear in SELECT, WHERE, or GROUP BY clauses, and can be separated by commas. To add to our last query, we might want to find the username, tweet time, and tweet text of all tweets containing the text 'obama' and have a non-NULL location:
SELECT screen_name, created_at, text FROM twitter WHERE text contains 'obama' AND location != NULL;
Printing tweets to your screen might be fun, but it's not always the desired result. TweeQL provides an INTO
keyword to faciliate dumping tweets to other locations. By default (when not specified), TweeQL sends tweets INTO STDOUT
(the screen). Alternatively, you can send tweets INTO TABLE tablename
, which will insert the tweets into a table tablename
. For example:
SELECT screen_name, text FROM twitter INTO TABLE obama_tweets WHERE text contains 'obama';
If obama_tweets
does not exist, it will be created with the schema specified by the SELECT
parameters. If the table already exists and matches the schema of the SELECT
parameters, tweets will be appended to that location.
The database which contains the table is specified in settings.py
either through the DATABASE_URI
or DATABASE_CONFIG
parameters. By default, a sqlite3 database called test.db
will be created in your current working directory.
For performance reasons, TweeQL batches records in groups of 1000 before inserting them into the database. This means that if you end the query before 1000 records are generated, you will lose those records. E-mail me if you require a more durable solution.
Tweets are hardly data on their own. Most meaningful uses of the tweet stream will require some finessing of the data that comes from the stream. As such, you can run functions that infer information or derive structure from the various fields that TweeQL offers by default. These functions can appear anywhere a field appears in a query. The functions that come with TweeQL are
strlen(val)
takesval
(a string) and returns the length of the string. Example:SELECT strlen(text) AS length FROM twitter_sample
will print the length of the text of each tweet.sentiment(val)
takes a stringval
and returns the sentiment of the text (sentiment classification). A positive value indicates positive sentiment, a negative one indicates negative sentiment, and 0 indicates neutral or no discernible sentiment. The magnitude of the value is used for aggregation---it does not indicate strength of the sentiment---and represents the inverse of the recall of the classifier for positive and negative tweets. Example:SELECT sentiment(text) AS sentiment FROM twitter_sample
will return the sentiment of tweets on the Spritzer stream.tweetLatLng("lat")/tweetLatLng("lng")
returns the latitude or longitude of the tweet. If the tweet is geotagged (e.g. by a mobile device), this precise value is used. If no geotagging information is provided, we use a geocoding service to get latitude and longitude values for the user-specifiedlocation
field. Example:SELECT tweetLatLng("lat") AS latitude, tweetLatLng("lng") AS longitude, screen_name FROM twitter_sample;
returns the latitude, longitude, and username of twitter users tweeting on the stream.floor(val, granularity)
returns the floor of a floating-point value to the granularity ofgranularity
. Example:SELECT floor(tweetLatLng("lat"), .5) AS latitude, floor(tweetLatLng("lng"), .5) AS longitude, screen_name FROM twitter_sample;
rounds the latitude and longitudes in the previous example to the multiple of .5 less than or equal to the actual value.temperatureF(val)
returns the temperature in Fahrenheit described in the stringval
. Performs a regular expression match along the lines of "...(floating-point number)°(C|F)..." which matches strings such as "...35°C..." and "...35°F...". When a Celcius temperature is matched, converts to Fahrenheit. Example:SELECT temperatureF(text) as temperature, tweetLatLng("lat") as latitude, tweetLatLng("lng") as longitude FROM twitter WHERE (text contains 'c' OR text contains 'f') AND temperature != NULL;
retrieves temperatures and latitude/longitude pairs expressed on the stream.meanDevs(val, [group1, group 2,...])
returns the mean deviation of val from the group depicted by [group1, group2]. Mean deviations can act in place of standard deviations for outlier detection in streaming scenarios. During a learning phase of 5 observations per group,meanDevs
returns negative values. Example:SELECT temperatureF(text) as temperature, tweetLatLng("lat") as latitude, tweetLatLng("lng") as longitude FROM twitter WHERE (text contains 'c' OR text contains 'f') AND temperature != NULL AND meanDevs(temperature, floor(tweetLatLng("lat"), 5), floor(tweetLatLng("lng"), 5)) < 2;
. To avoid pulling in temperatures from tweets such as "It's 1000000°C in here," we want to pull in tweets whose deviation from the mean within a 5-degree latitude/longitude box is less than 2. In practice, you will also want this mean deviation check to drop deviations that are negative, since the system is still in a data collection phase for that group.
You can define your own functions as well. To learn how, read the User-Defined Functions section below.
TweeQL provides SQL-like aggregates (AVG
, COUNT
, SUM
, MIN
, and MAX
). Since these functions are operating on a stream, TweeQL also provides a WINDOW keyword, which allows you to set the time window over which to calculate the aggregate values before emitting them. For example:
SELECT location, COUNT(text) AS numtweets FROM TWITTER_SAMPLE GROUP BY location WINDOW 120 seconds;
would return the location and number of tweets at that location on the Spritzer stream every 120 seconds.
Working from the TweeQL command-line is a good way to get started, but you might want to issue TweeQL queries programmatically.
Here's a code sample to start with, also available at tweeql-programmatic.py:
from tweeql.exceptions import TweeQLException
from tweeql.query_runner import QueryRunner
runner = QueryRunner()
runner.run_query("SELECT text FROM twitter_sample;", False)
QueryRunner.run_query
takes a TweeQL query and executes it. The second argument to run_query
determines whether the query runs asynchronously or not. In this case, the program blocks on the run_query
call because the second argument is False
. If you run with the second argument as True
, a new thread will execute the query. To stop an asynchronous query programmatically, use runner.stop_query()
.
The examples we have shown have used several user-defined functions (sentiment
, tweetLatLng
, etc.). If you wish to write your own UDF, you can do so in python. The following example, in which we count the length of tweets on the Spritzer stream, can be found at tweeql-udf.py:
from tweeql.exceptions import TweeQLException
from tweeql.field_descriptor import ReturnType
from tweeql.function_registry import FunctionInformation, FunctionRegistry
from tweeql.query_runner import QueryRunner
class StringLength():
return_type = ReturnType.INTEGER
@staticmethod
def factory():
return StringLength().strlen
def strlen(self, tuple_data, val):
"""
Returns the length of val, which is a string
"""
return len(val)
fr = FunctionRegistry()
fr.register("stringlength", FunctionInformation(StringLength.factory, StringLength.return_type))
runner = QueryRunner()
runner.run_query("SELECT stringlength(text) AS len FROM twitter_sample;", False)
In this example, we build the class StringLength
, which has two methods: a factory
method and an strlen
method. Stricly speaking, the only method that a UDF requires is a factory
method, which returns another method (in this case strlen
) that performs the actual computation. The only requirement for the UDF computation method is that its first argument be tuple_data
, a dictionary of data that is returned with all tweets (read more about this here).
You can include any number of arguments after tuple_data
. In our case, we only take one: the string val
which we wish to take the length of. Note that when we run the query on the last line of this example, we pass text
as an argument to stringlength
---this is the value that val
will take on.
After we have defined our UDF in StringLength
, we then register the function. To do this, we create a FunctionRegistry
, which is a singleton, and register the function as the TweeQL stringlength
function. You can see the stringlength
function in action on the last line of this example: it's a first-class citizen of our queries now!
To see more UDFs, take a look at builtin_functions.py.
tweepy---Twitter API in Python. TweeQL would not exist without Tweepy!
pyparsing---For the TweeQL parsing
sqlalchemy---For putting results into a relational database
ordereddict---For creating an LRU cache of keys (geoloc cache)
geopy---For geolocation
nltk---For sentiment analysis
TweeQL is BSD-licensed. Do what you will with it, but be nice!