Sunday, March 06, 2011

Book Review: Python 2.6 Text Processing



Python is a powerful and dynamic programming language that is used in a wide variety of application domains such as web and internet development, databases access, desktop GUIs, scientific and numeric, education, network programming, software development, as well as games and 3D graphics.
As a security analyst I'm always interested in ways to better query vast quantities of text such as parsing web server logs for various signs of evil.
Jeff McNeil's Python 2.6 Text Processing Beginner's Guide from Packt Publishing struck me as useful resource with which to improve Python skills specific to text processing.
This book is intended for novice Python developers interested in processing text (me), and is laid out and written so as to be very supportive of this cause.
First published in December 2010, Python 2.6 Text Processing is organized via these conventions:
  • Time for action - inclusive of multiple instructions followed by extra detail and explanation (What just happened?)
  • Pop quiz - to help you test your understanding of methods just discussed
  • Have a go hero - practical challenges to put your learning to use
I appreciate the logical flow of the book, moving from basic concepts and IO handling, to strings services and standard library usage, to regular expressions, structure markup, encoding, and advanced output.
With my interest in web server log manipulation I found myself able to quickly embrace the concepts offered and make us of this book's code samples offered on the Packt Publishing website.
Anyone who operates a website and spends any time reviewing web logs is likely aware that a certain percentage of all traffic bound for their site is malicious, be it uniquely targeted or bot traffic crawling by looking for weak spots.
One such example is remote file include (RFI) attempts. I've been using a Perl script to parse my logs for such traffic but have wanted to use such analysis as an opportunity to learn Python and ultimately rewriting the scripts in Python. While I haven't gotten there yet, I am certain this book will aid me entirely.
Of additional use is the fact that Python 2.6 Text Processing offers additional resources such documentation APIs, community resources such as mailing lists and conferences, as well as discussion of Python 3 and what to expect in migrating.
Returning to the RFI analysis mentioned above, I used Python to pull interesting, related results out of my web logs.
While Chapter 2 of Python 2.6 Text Processing introduces a web server log parser, and builds on it through out the chapter, I was drawn to searching and indexing as described in Chapter 11 via the use of the Nucular libraries (no, not the Bush mispronunciation).
"Nucular is a system for creating full text indices for fielded data. It can be accessed via a Python API or via a suite of command line interfaces."
First, ensure that you've installed the SetupTools easy_install system via python ez_setup.py as discussed on page 23. Once installed issue easy_install nucular, and the libraries and related dependencies will be installed to the appropriate paths.
With some modifications to the provided code samples, I then created an index of three years worth of web logs from my site, and was able to query them as a single source for keywords indicative of RFI attacks. While I started with a simple linear search across multiple logs via text_scan.py as seen on page 302 I quickly learned why McNeil is proving the linear search method as laborious and ineffective, instead promoting the use of libraries such as Nucular, and he's right.
Overall, this book is an effective learning tool, though keep in mind that it's entirely Linux-centric. Syntax for those of you using Python on Windows is subject to nuances.
McNeil's done a solid job with Python 2.6 Text Processing Beginner's Guide; it's a verbose (sometimes he turns on the fire hose) but worthy read and a suggested purchase at $45 +/- direct from Packt, Amazon, or Barnes and Noble, earning 3.5 stars out of 5 (very good).
Give it a read and put those mad new Python skills to good use.
Cheers.

del.icio.us | digg | Submit to Slashdot

Please support the Open Security Foundation (OSVDB)

No comments:

Moving blog to HolisticInfoSec.io

toolsmith and HolisticInfoSec have moved. I've decided to consolidate all content on one platform, namely an R markdown blogdown sit...