Python MongoDB notes
MongoDB is a popular new schemaless, document-oriented, NoSQL database. It is useful for logging and real-time analytics. I'm working on a tool to store log files from multiple remote hosts to MongoDB, then analyze it in real-time and print pretty plots. My work in progress is located on github.
Here are my first steps using PyMongo. I store an Apache access log to MongoDB and then query it for the number of requests in the last minute. I am running on Ubuntu Karmic 32-bit (though I think MongoDB really wants to run on 64-bit).
Install and run MongoDB
- Download and install MongoDB (Reference)
cd ~/lib curl http://downloads.mongodb.org/linux/mongodb-linux-i686-latest.tgz | tar zx ln -s mongodb-linux-i686-2010-02-22 mongodb
- Create data directory
mkdir -p ~/var/mongodb/db
- Run MongoDB (Reference)
~/lib/mongodb/bin/mongod --dbpath ~/var/mongodb/db
Install PyMongo
- Install pip
- Install PyMongo (Reference)
sudo pip install pymongo
Simple Example
writer.py:
import re
from datetime import datetime
from subprocess import Popen, PIPE, STDOUT
from pymongo import Connection
from pymongo.errors import CollectionInvalid
HOST = 'us-apa1'
LOG_PATH = '/var/log/apache2/http-mydomain.com-access.log'
DB_NAME = 'mydb'
COLLECTION_NAME = 'apache_access'
MAX_COLLECTION_SIZE = 5 # in megabytes
def main():
# connect to mongodb
mongo_conn = Connection()
mongo_db = mongo_conn[DB_NAME]
try:
mongo_coll = mongo_db.create_collection(COLLECTION_NAME,
capped=True,
size=MAX_COLLECTION_SIZE*1048576)
except CollectionInvalid:
mongo_coll = mongo_db[COLLECTION_NAME]
# open remote log file
cmd = 'ssh -f %s tail -f %s' % (HOST, LOG_PATH)
p = Popen(cmd, shell=True, stdout=PIPE, stderr=STDOUT)
# parse and store data
while True:
line = p.stdout.readline()
data = parse_line(line)
data['time'] = convert_time(data['time'])
mongo_coll.insert(data)
def parse_line(line):
"""Apache combined log format
%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"
"""
m = re.search(' '.join([
r'(?P<host>(\d+\.){3}\d+)',
r'.*',
r'\[(?P<time>[^\]]+)\]',
r'"\S+ (?P<url>\S+)',
]), line)
if m:
return m.groupdict()
else:
return {}
def convert_time(time_str):
time_str = re.sub(r' -\d{4}', '', time_str)
return datetime.strptime(time_str, "%d/%b/%Y:%H:%M:%S")
if __name__ == '__main__':
main()
reader.py:
import time
from datetime import datetime, timedelta
from pymongo import Connection
DB_NAME = 'mydb'
COLLECTION_NAME = 'apache_access'
def main():
# connect to mongodb
mongo_conn = Connection()
mongo_db = mongo_conn[DB_NAME]
mongo_coll = mongo_db[COLLECTION_NAME]
# find the number of requests in the last minute
while True:
d = datetime.now() - timedelta(seconds=60)
N_requests = mongo_coll.find({'time': {'$gt': d}}).count()
print 'Requests in the last minute:', N_requests
time.sleep(2)
if __name__ == '__main__':
main()
Running python writer.py in one terminal and python reader.py in another terminal, I get the following results:
Requests in the last minute: 13 Requests in the last minute: 14 Requests in the last minute: 14 Requests in the last minute: 14 Requests in the last minute: 13 Requests in the last minute: 14 Requests in the last minute: 15 ...




