Nodejs Production Strategy : Async, Logging and Event loops


If you are or planning to run Nodejs (Express) in production for your solution then you should avoid certain mistakes that most developers do when deploying the application in production that may include using the – Not so useful logging packages, Not checking if the application is having a event loop block etc.

I have tried to explain the concept and suggest some solutions that might help you find your own solutions in longer run.

“Node.js application servers are single thread deployments & Parallelism does not exist from the programmer’s perspective as its I/O bound rather then CPU bound.”


(1) Writing Async Workflows

That means if you wrote code not following async patterns or workflow then certain loops can make the application slow. Let me explain this with example, we can assume you have a web application that has few thousand active users and all have a common feature of authenticating user before accessing core features. If you didn’t write the code using async workflows then other users would have to queue and that can be painful. Here is a list of packages that can help you write async workflows or pipelines.

(2) Unblock Event Loops

As mentioned the node.js process is a single thread deployment meaning that anything that blocks the event loop would block almost everything. Here is a good post on understanding event loop.

After you understand what is a event loop then you can easily see that blocking can be serious issue. Let me try to explain the same with a use case, lets assume you have a event where you have to parse large files as user uploads. If you didn’t write async functions to handle it would freeze the application server and you might not even have a single clue what just happened. To debug this a fantastic tool exists named “blocked“. The logic to this package is pretty simple it calculates time difference by getting the time at two specific time instances.

To use in your express app :

var blocked = require('blocked');

//Function01 set after 500ms
//Some heavy computing
}, 500);

//Function01 set after 3000ms
//Some heavy computing
}, 3000); //Difference between two and tell if it was blocked.
console.log('BLOCKED FOR %sms', ms | 0);

Other way is using procmon that gives a UI for seeing event loop delays etc

(3) JSON Logging
It might b a good idea to log everything in JSON as its easy to query and off course you can bind events in case of error handling. Using something better then console.log() is good idea because :

  • Built in serializers for errors, response and objects.
  • Where logs are saved and processed.
  • Adding more parameters in the log format including host-name, process id and application name. I would recommend binding API paths to their role of function as helps detect what happened when.
  • Log file rotation and more can be added.

In order to do that, check node-bunyan. It is straight forward as it allows you to create a logger variable and then use it with logging mechanism. The output is all JSON which is awesome.

More production aspects coming soon!


Share Button

ngCordova : Device APIs for your Hybrid mobile apps.


As I discussed in my last post that discussed about how ionic apps can be a lifesaver for javascript developers producing production mobile apps without learning multitude of different programming languages. I stick to Javascript for good here :-)

The aspect that one needs to learn while developing a hybrid mobile solution  (other than javascript) is Cordova plugins. As you all know that every mobile app needs to be efficient and smart enough to utilise the device platform that can be essential for any mobile strategy may it be customer engagement (using Push notifications) or simply uploading pictures to the system (using camera for picture click and upload) etc.

Offcourse Cordova is a open style development repository that can be challenging if you don’t have any device API experience. A solution to this problem can be exploring ngCordova which is a a collection of AngularJS wrappers done right for ionic framework. It also has the right documentation to get started and rock in production setups. 

I would be focusing on IOS platform for now and its worth mentioning TouchID is supported here and that is just awesome. Let me walk you through the process to setup and add a single plugin below.

  • Installing ngCordova to your Ionic app : You need to add it in your dependencies in “bower.json” and install it. The other way is to install it directly 
$bower install ngcordova
  • Goto your “www/index.html” and add : 

<script src=”lib/ngCordova/dist/ng-cordova.js”></script>

<script src=”cordova.js”></script>

  • Inject in Angular Module 
angular.module('myApp', ['ngCordova'])
  • Avoid fully loaded device so add in your ngcordova.js before using any plugin

$ionicPlatform.ready(function() {$cordovaPlugin.someFunction().then(success, error);});

Now how to add a Cordova plugin such as TouchID (offcourse you can add any plugin using the same process).

  • STEP 1: Create a Ionic project and test it works

$ionic start test tabs

$cd test

$ionic platform add ios

$ionic serve

  • STEP 2: Add touchID plugin 

$cordova plugin add cordova-plugin-touchid

  • STEP 3: Add “ng-cordova.js” and inject in angular dependency as mentioned above. 
  • STEP 4 : Writing testController that would call for authentication, check if successful and if not then pass error and offcourse log the error in console.

.controller("testController", function($scope, $ionicPlatform, $cordovaTouchID) {

$ionicPlatform.ready(function() {

$cordovaTouchID.checkSupport().then(function() {

$cordovaTouchID.authenticate("You must authenticate").then(function() { alert("The authentication was successful"); },

function(error) {console.log(JSON.stringify(error)); }


}, function(error) { console.log(JSON.stringify(error)); });



Happy coding!

Share Button

Building rockstar Hybrid mobile apps using Javascript


You are a “WEB-ONLY” nodejs developer and realise that most of the productive businesses are trending to go mobile only (sooner or later). Ok, maybe you build responsive apps that somehow blend to work with browsers but web standards cannot directly use the mobile API’s. This means powerful products cannot utilise the interaction with users in realtime using mobile api’s such as push notification, access etc.

Well a few months  back I would have asked you to goto “Apache Cordova” which is a rock solid platform to build hybrid apps but it sure misses the creative part i.e. somehow easy integration with node frameworks such as express and a clientside MVC attachment.

Hybrid apps have HTML,CSS and Javascript as the ruling code where as they can work to behave same as native apps.

Ionic is answer! It packs a Angular as its client side MVC, easily integration with Express framework and few awesome theme options such as IONIC-MATERIAL. Offcourse bringing express to the table adds a possibility to open a lot of opportunities that can help easily integrate with present node app or new architecture written in node. On top of all this it has a code generator for you to choose from start i.e. “blank”, “sidemenu” and “tabs”. Lastly it supports both IOS and Android production that you can test and launch in various app stores. 

To get started, you can install node on your local system and follow the steps :
INSTALL : $npm install -g cordova ionic
CREATE APP : $ionic start app_name sidemenu/tabs/*nothing*
$cd app_name
DEVELOP :$ionic serve (this would start the app on localhost:8100).
ADD PLATFORM :$ionic add platform ios/android
BUILD IT : $ionic emulate ios/android

For setting up emulators, visit here for IOS and here for Android

Add material design to your ionic app, check here.

Happy hacking!

Share Button

Grunt.js strategy for Node development environment


Node.js has evolved over the past few years and for java-script developers it has opened a whole new set of opportunities that developers can exploit. I have been using nodejs in production for about 2 years now and after a learning curve i found grunt to be extremely useful in aiding any developer during the development process.

Grunt.js is a task manager that can automate things for you.

As said, development is not only about writing clean code but its also about maintaining a workflow of how files are handled during build, managing cleaning after build, checking the code quality, running as a continuous process with error handling, minification, compiling less/sass, unit testing etc. The strategy that i would share that seems to work for any development process is :

  • Handling files during build: “grunt-contrib-copy” would help you copy files from one folder to another. This can be useful if you want to copy files from ‘bower_components’ to ‘public/vendor’ and later perform some operation. “grunt-contrib-clean” would help you clean all the paths and files that you created during build process.
  • Compiling less/Sass : “grunt-contrib-less” & “grunt-contrib-sass” woud compile css for you from less/sass files.
  • Minification : “grunt-contrib-uglify” would let you minify javascript producing a *.min.js and * “grunt-contrib-cssmin” would help you minify your css compiled or from bower packages etc.
  • Watching : “grunt-contrib-watch” can help you watch specific folders and trigger restart server etc.
  • Continuous process : “grunt-contrib-nodemon” can help you launch nodemon from grunt.
  • Parallel Tasks :  “grunt-concurrent” is awesome package that can help you run many tasks in parallel.
  • Run only for modified code : “grunt-newer” can help you run grunt tasks only on modified code.
  • Code Quality : “grunt-contrib-jshint” can help you run jshint from grunt. It validates your code using jshint.
  • Testing : “grunt-mocha-test” can help you run server side mocha tests from grunt.

Offcourse you can register a new task that can be a combination of few or more tasks added in strategy. Example is mentioned below.

grunt.registerTask(‘default’, [‘copy:vendor’, ‘newer:uglify’, ‘newer:less’,’concurrent’]);

General syntax should be :

module.exports = function(grunt) {


grunt.registerTask('default', ['task1', 'task2']);



//specific to package



//specific to package




Share Button

Why I choose to code in Javascript?


Javascript was created by Netscape as Livescript. It always has been misunderstood over its potential as its a “Scripting” Language then programming language. Although, it has been there for many years and has gone through many transitions over execution and implementation improvements.

It offers solutions to many problems that I have tried to sum and share for you.

They are as follows:

  • Solution 1 : Unify Client side and Server side code 

There have been many attempts to do this in the past but it has always been un accepted by the developer community and most of them failed miserably bad. Although this is important aspect that is required to power scalability and acceptance, sync among developer teams. Off-course code wise it is important as well because one can reuse various components and resources.

Tip : Angular/Backbone and NodeJS/Express all work on same javascript principles yet serve the client and server side duties when deployed.

  • Solution 2: Promoting the Non-Blocking programming 

Javascript on stack follows the async architecture – this is easy to understand if you have a clear understanding of AJAX. In Async, lets say two events (One complex, One simple) have to execute, then the output is generated first for simple event as the complex one will take time to execute i.e not halting the process. This makes the system extremely fast. Another aspect is inbuilt Callbacks and Event loops in the language. Ok callbacks are not that tough to do in most other languages but here it is build-in binding it to the I/O event loop making it easy for developers to write event driven functions that do callbacks as part of inbuilt functions.

Let me explain this with example :

Consider Both the codes that will give you same outputs but the difference is :  First one follows sync architecture whereas other one follows async architecture.

function getUser(id) {
var user = db.query(id);
return user;

console.log('Name: ' + getUser(432).name);

The function blocks until the database call is completed. This means the server is doing nothing but waiting until the function completes, ignoring other pending requests.

function getUser(id, callback) {
db.query(id, callback);

getUser(432, function (user) {


The function will first show “Done” and then show the username that is fetched from the database.

A much better approach of Non-blocking programming can be seen in this code where callback is used :

function add(list, title, position, callback) {
// Add new item to 'items' store
db.insert('items', { list: list, title: title }, function (item) {
// Set position if requested
if (position !== null) {
db.get('items.sort', list, function (sort) {
addToListAt(sort,, position);
db.update('items.sort', sort, function () {
else {
function finish(id) {
// Perform final processing
callback({ id: id, status: 'ok', time: (new Date()).getTime() });

The if else simplifies the function. The skill to learn here is that one needs to think in this approach if you have write good javascript code i.e think about functions as event loops. This style is a bit tricky to catchup and even a bit more tricky while debugging.

  • Solution 3 : Fast prototyping 

Object-Oriented Programming without classes (and without endless hierarchies of classes) allows for fast development (create objects, add methods, and use them). This reduces refactoring time during maintenance tasks by allowing the programmer to modify instances of objects instead of classes. This speed and flexibility paves the way for rapid development.

  • Solution 4: Functional Programming

Its easy to create a library of reusing functions that can play important role in multiple components. Its very easy to integrate this to new projects. In functional programming the function depends upon the arguments not the inputs.





Share Button

Sharing simple file upload snippet in Javascript


File Upload is important part of any web application. May it be a simple image album or a complex enterprise system, its required everywhere.

Facts :Since the application revolves around Restful, lets get the verbs clear :

  • Node.js doesn’t allows you to peep into directories and get download the files where these directories are generally private unless you make it as Public Folder.
  • In any NoSQL Store, you can create many collections. Lets say every collection corresponds to a specific collection.
  • When
  1. GET : List of Members of the collection, complete with their member URIs for further navigation.
  2. PUT : Replace entire collection with another collection.
  3. POST: Create a new entry in the collection where the ID is assigned by the collection.
  4. DELETE: Delete the entire collection.
  • When
  1. GET : Retrieve a representation of the addressed member of the collection expressed in an appropriate MIME type.
  2. PUT : Update the addressed member of the collection or create it with the specified ID.
  3. POST: Treats the addressed member as a collection in its own right and creates a new subordinate of it.
  4. DELETE: Delete the addressed member of the collection.
  • In NodeJS, If using express for earlier version then you can use bodyParser() middleware. Although, If you are using the latest express version then that specific middleware is removed.
  • In NodJS, the file information is saved in req.files.

The logic :

  • STEP 1: The index.html

<!DOCTYPE html>
<html lang=”en” ng-app=”APP”>
<meta charset=”UTF-8″>
<title>Simple FileUpload Module</title>

<form method=’post’ action=’upload’ enctype=”multipart/form-data”>
<input type='(write F_I_L_E here)’ name=’fileUploaded’>
<input type=’submit’>

  • STEP 2 : Defining /Upload with and also showing the uploading confirmation on terminal

var express = require('express'); //Express Web Server
var busboy = require('connect-busboy'); //middleware for form/file upload
var path = require('path'); //used for file path
var fs = require('fs-extra'); //File System - for file manipulation

var app = express();
app.use(express.static(path.join(__dirname, 'public')));

.post(function (req, res, next) {

//Function to upload the file and pass it to the folder

var fstream;
req.busboy.on('file', function (fieldname, file, filename) {
console.log("Uploading: " + filename);

//Path where image will be uploaded
fstream = fs.createWriteStream(__dirname + '/public/file/' + filename);
fstream.on('close', function () {
console.log("Upload Finished of " + filename);
res.redirect('back'); //where to go next

var server = app.listen(3030, function() {
console.log('Listening on port %d', server.address().port);

You can download the working code here on Github :

Share Button

Neural Networks (ANN) and brainjs

Neural network abstract 3d rendering

The post is for fundamental understanding on how to get started with Neural networks and build applications using NodeJS. I have not discussed any self developed example for now. I would share the same very soon.

Briefly I would introduce ANN i.e Artificial Neural Networks : The introduction of the concepts talk fundamental based using logic gates in design but nowadays we just simulate the logic in our web applications using libraries to get the same results.

What is ANN ?

An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurones. This is true of ANNs as well.

Why use ANN ?

Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an “expert” in the category of information it has been given to analyse. This expert can then be used to provide projections given new situations of interest and answer “what if” questions.
Other advantages include:

  1. Adaptive learning: An ability to learn how to do tasks based on the data given for training or initial experience.
  2. Self-Organisation: An ANN can create its own organisation or representation of the information it receives during learning time.
  3. Real Time Operation: ANN computations may be carried out in parallel, and special hardware devices are being designed and manufactured which take advantage of this capability.
  4. Fault Tolerance via Redundant Information Coding: Partial destruction of a network leads to the corresponding degradation of performance. However, some network capabilities may be retained even with major network damage.

Difference between Algo based computing systems and ANN :

Neural networks take a different approach to problem solving than that of conventional computers. Conventional computers use an algorithmic approach i.e. the computer follows a set of instructions in order to solve a problem. Unless the specific steps that the computer needs to follow are known the computer cannot solve the problem. That restricts the problem solving capability of conventional computers to problems that we already understand and know how to solve. But computers would be so much more useful if they could do things that we don’t exactly know how to do.

Neural networks process information in a similar way the human brain does. The network is composed of a large number of highly interconnected processing elements(neurones) working in parallel to solve a specific problem. Neural networks learn by example. They cannot be programmed to perform a specific task. The examples must be selected carefully otherwise useful time is wasted or even worse the network might be functioning incorrectly. The disadvantage is that because the network finds out how to solve the problem by itself, its operation can be unpredictable.

On the other hand, conventional computers use a cognitive approach to problem solving; the way the problem is to solved must be known and stated in small unambiguous instructions. These instructions are then converted to a high level language program and then into machine code that the computer can understand. These machines are totally predictable; if anything goes wrong is due to a software or hardware fault.

Neural networks and conventional algorithmic computers are not in competition but complement each other. There are tasks are more suited to an algorithmic approach like arithmetic operations and tasks that are more suited to neural networks. Even more, a large number of tasks, require systems that use a combination of the two approaches (normally a conventional computer is used to supervise the neural network) in order to perform at maximum efficiency.

Architecture of ANN :

They are divided into :

(1) Feed Forward Networks OR Bottoms up OR Top down Networks :

  • The signal travels from Input to the output.
  • There is no feedback loops i.e O/P doesn’t affect the same layer.
  • Use extensively in pattern recognition.

(2) Feedback Networks :

  • The signal can flow from O/P to I/P as well.
  • They are powerful and can be extremely complicated.
  • There state is constantly changing. The state is dynamic.
  • Point of Equilibrium is only when the change is detected in input.

(3) Network Layers

The commonest type of artificial neural network consists of three groups, or layers, of units: a layer of “input” units is connected to a layer of “hidden” units, which is connected to a layer of “output” units.

  • The activity of the input units represents the raw information that is fed into the network.
  • The activity of each hidden unit is determined by the activities of the input units and the weights on the connections between the input and the hidden units.
  • The behaviour of the output units depends on the activity of the hidden units and the weights between the hidden and output units.

This simple type of network is interesting because the hidden units are free to construct their own representations of the input. The weights between the input and hidden units determine when each hidden unit is active, and so by modifying these weights, a hidden unit can choose what it represents.

(4) Perceptrons

  • They follow the MCP model i.e Neurons with Weighted Objects.
  • Every Input has specific tasks that it has to perform following a model weight based approach.

How to Train ANN ? 

This is the most important portion that one need to think before applying ANN to real world web applications. The bigger view can be divided into :

(1) Supervised Learning 

  • It incorporates an external teacher, so that each output unit is told what its desired response to input signals ought to be.
  • During the learning process global information may be required.
  • Paradigms of supervised learning include error-correction learning, reinforcement learning and stochastic learning.

Constraints :

  • Problem of error convergence, i.e the minimisation of error between the desired and computed unit values. The aim is to determine a set of weights which minimises the error.
  • One well-known method, which is common to many learning paradigms is the least mean square (LMS) convergence.

(2) Unsupervised Learning 

  • It uses no external teacher and is based upon only local information.
  • It is also referred to as self-organisation, in the sense that it self-organises data presented to the network and detects their emergent collective properties.
  • Paradigms of unsupervised learning are Hebbian lerning and competitive learning.
  • We say that a neural network learns off-line if the learning phase and the operation phase are distinct. A neural network learns on-line if it learns and operates at the same time. Usually, supervised learning is performed off-line, whereas usupervised learning is performed on-line.

Other level of learning methods specific to ANN can be  :

(1) Associative Mapping 

  • The network learns to produce a particular pattern on the set of input units whenever another particular pattern is applied on the set of input units.

Divided further into :

  • Auto-association: an input pattern is associated with itself and the states of input and output units coincide. This is used to provide pattern completition, ie to produce a pattern whenever a portion of it or a distorted pattern is presented. In the second case, the network actually stores pairs of patterns building an association between two sets of patterns.

  • Hetero-association: is related to two recall mechanisms:

    Nearest-neighbour recall, where the output pattern produced corresponds to the input pattern stored, which is closest to the pattern presented, and

    Interpolative recall, where the output pattern is a similarity dependent interpolation of the patterns stored corresponding to the pattern presented. Yet another paradigm, which is a variant associative mapping is classification, ie when there is a fixed set of categories into which the input patterns are to be classified.

(2) Regularity Detection 

  • The units learn to respond to particular properties of the input patterns. Whereas in asssociative mapping the network stores the relationships among patterns, in regularity detection the response of each unit has a particular ‘meaning’.
  • This type of learning mechanism is essential for feature discovery and knowledge representation.
  • Every neural network posseses knowledge which is contained in the values of the connections weights. Modifying the knowledge stored in the network as a function of experience implies a learning rule for changing the values of the weights.

(3) Fixed Networks Weights are able to change.

  • Weights are not able to change.
  • The values are set prior to problem to solve.

(4) Adaptive Networks 

  • Weights are able to change.

The other important concepts related to learning are :

  • Transfer Function

ANN depends upon weights, Input and Output that are defined in units. So the units are defined as  :

  1. Linear Units : O/P activity is proportional to the total weighted output.
  2. Threshold Units : the output is set at one of two levels, depending on whether the total input is greater than or less than some threshold value.
  3. Sigmoid Units : the output varies continuously but not linearly as the input changes. Sigmoid units bear a greater resemblance to real neurones than do linear or threshold units, but all three must be considered rough approximations.

Note : We must choose these units correctly, Also check if the unit of one system can influence another system.

  • Back Propagation Algorithm

In order to train a neural network to perform some task :

  • We must adjust the weights of each unit in such a way that the error between the desired output and the actual output is reduced.
  • This process requires that the neural network compute the error derivative of the weights (EW).
  • In other words, it must calculate how the error changes as each weight is increased or decreased slightly.
  • The back propagation algorithm is the most widely used method for determining the EW.

The back-propagation algorithm is easiest to understand if all the units in the network are linear. The algorithm computes :

  • Each EW by first computing the EA, the rate at which the error changes as the activity level of a unit is changed.
  • For output units, the EA is simply the difference between the actual and the desired output.
  • To compute the EA for a hidden unit in the layer just before the output layer, we first identify all the weights between that hidden unit and the output units to which it is connected.
  • We then multiply those weights by the EAs of those output units and add the products. This sum equals the EA for the chosen hidden unit.
  • After calculating all the EAs in the hidden layer just before the output layer, we can compute in like fashion the EAs for other layers, moving from layer to layer in a direction opposite to the way activities propagate through the network. This is what gives back propagation its name.
  • Once the EA has been computed for a unit, it is straight forward to compute the EW for each incoming connection of the unit. The EW is the product of the EA and the activity through the incoming connection.

ANN and NodeJS

We use a specific library – Brain.js that can be found here.


//If you have express-generator installed

$express --css stylus -ejs Appname

$cd Appname

$npm install

$npm install brain

Configuring and Defining the functions

Note : I have tried to explain the code in comments.

//You can define these in the app.js or use export() to create a separate code.

var brain = require(./lib/brain');
// Defining a variable to call the library.
var net = new brain.NeuralNetwork();
// Call the Input for the Library
//This can be dynamic from the web application
// Call the train() to train your ANN
net.train([{input: [0, 0], output: [0]},
{input: [0, 1], output: [1]},
{input: [1, 0], output: [1]},
{input: [1, 1], output: [0]}]);
// Defining the Output for the ANN.
var output =[1, 0]); // [0.987]

How to use the same using in a web application :

INPUT : You can define the Input from any source. I would recommend :

  1. Take the input from the input as JSON from a process. This will make it dynamic.
  2. Take the input from user on the web GUI.

OUTPUT : You can define the output here as :

  1. Set of values fetched from JSON from a process. This will make it dynamic.
  2. Define them manually as set of values.

TRAIN :  Each training pattern should have an input and an output, both of which can be either an array of numbers from 0 to 1 or a hash of numbers from 0 to 1.


{input: {}, output: {}}

errorThresh: 0.005,// error threshold to reach
iterations: 20000, // maximum training iterations
log: true, // console.log() progress periodically
logPeriod: 10, // number of iterations between logging
learningRate: 0.3// learning rate}

The Output of Train()

error: error_value_in_numerics;  //0.0004 i.e less then or equal to one
iterations: number_of_iterations_in_value_numerics; //206


It is very painful to compute online for learning and training process. One alternative can be using JSON as the offline data training solution.

var json = net.toJSON();


var net = new NeuralNetwork({
hiddenLayers: [4],
learningRate: 0.6 // global learning rate, useful when training using streams
//Hidden Layers can be defined as hiddenLayers[X,Y]. The X and Y are values of hidden layers in two layers.

All good to go! Thanks for reading.

Share Button

Setting Apache Hadoop on Nitrous.IO


Once you have setup any language box (I choose to setup node box) proceed to run the following :

$cd workspace

$ wget

$ssh-keygen -t dsa -P '' -f ~./ssh/id_dsa

$cat ~/.ssh/ >> ~/.ssh/authorized_keys

$chmod 600 ~/.ssh/authorized_keys

$vim hadoop-1.2.1/config/

//Here we set the jvm path. You need to check using $java -version else it would give a error saying JAVA_HOME not set.

// export JAVA_HOME = /usr/lib/jvm/java-7-oracle

$ bin/hadoop namenode -format

$ bin/hadoop fs -mkdir input

$ bin/hadoop fs -put conf input

$ bin/hadoop fs -cp conf/*.xml input

$ bin

$bin/hadoop jar hadoop-examples-1.2.1.jar grep input output 'dfs[a-z.]+'

$bin/hadoop fs -rmr output

$bin/hadoop jar hadoop-examples-1.2.1.jar wordcount input output

$bin/hadoop fs -rmr output


Now you need to configure core-site.xml


also hdfs-site.xml


Lastly mapred-site.xml


Thats all! Just run http://localhsot:50030 for Hadoop.

Share Button

Security tips for expressjs


Another good topic and concerns from the last meetup  is the security of Express/Node applications.

You can download a Express ready skeleton/seed that has all this configuration setup for you here mentioned below. You can use it to start building your application right away.

This post is kind-of based on the observations that I collected from various data sources on the internet. I have also added a suitable conclusion based on the collection and analysis. So lets get started.

Step 1 : Follow best practices to actually solve most security issues

  • No root please : This is prefixed for you. Hey wait! What the hell it actually means? Some ports like 80 and 443 etc are privilege port numbers and they require root access. But why would use them, exactly you don’t have to as for noobs its already fixed by setting default as 3000. You can also use 8080 but not from any port till 1024. You can read this awesome stacker that tell why ports up-to 1024 have privileges.

Ok. Suppose you have to set the same on 0-1024 aka privilege ports you can use the node function i.e process.setuid() & process.setguid() after you have set the port in the app.js. This would allow a specific groupid or a uid that have lower privileges than root.

http.createServer(app).listen(app.get('port'), function(){
console.log("Express server listening on port " + app.get('port'));
  • Use HTTPS when dealing with User sessions : Remember my presentation where I was talking about using connect-mongo to save the session in MongoDB. Make sure you set the secure as true and HTTPonly as true as-well. This would allow to pass the session as HTTPS always. Making the secure as true will run with SSL.
secret: "notagoodsecretnoreallydontusethisone",
cookie: {httpOnly: true, secure: true},
  • Use Helmet for Security Headers : It has all these middle-wares that can help you implement various security headers to protect your app in various ways. To know about the various security headers that make a difference check here.
  1. csp (Content Security Policy)
  2. hsts (HTTP Strict Transport Security)
  3. xframe (X-Frame-Options)
  4. iexss (X-XSS-Protection for IE8+)
  5. ienoopen (X-Download-Options for IE8+)
  6. contentTypeOptions (X-Content-Type-Options)
  7. cacheControl (Cache-Control)
  8. crossdomain (crossdomain.xml)
  9. hidePoweredBy (remove X-Powered-By)

You should implement them as part of app.configure in app.js. Soon I would talk about how the various security headers work in general.

Although express has a inbuilt middle-ware that helps you protect from CSRF. Its not by default but you can use it if you want, just in case you want it to be secure. Apart from sarcastic jokes the code is as simple as it sounds. We use “csrftoken” to create a specific token for every template. Just check this very interesting post that tells how facebook solves the csrf/xsrf issues on its end.

app.use(function (req, res, next) {
res.locals.csrftoken = req.session._csrf;
  • Do you use the default error handlers : Yes this is created from Express V4 by default. Although you have to configure this if you use index.html than ejs/jade. But its not that tough.

Step 2 : Define your strategy for HTTP API done with Express :

Yes you got it right if you are thinking as back-end developer. If you don’t use these strategies then the correct phase would be “Shit just got real”. All your data objects that are stored in Data-store can be easily controlled or worst modified via the HTTP API that you implemented very beautifully.

  1. Use a Middle-ware that does authorization for you : Create a function that defines the state is authorized or unauthorized. Check “express-authorization” (or just use any other that meets your need) and just make a function access() and checkAuthorization()
  2. Now just use this function use app.use() i.e global so even if you define any specific REST resources for API guest endpoints would always be left.
  3. Define Guest endpoints.

//Define in app.js or server.js

var authorize = require('express-authorization');

function access(req, res, next) {
checkAuthorization(req, function (err, authorized) {
if (err || !authorized) {
res.send({message: 'Unauthorized', status: 401});


function checkAuthorization(req, callback) {
//You have to do as per express-authorization API parameters and off-course as per your application.

//Define this is routes.js

function peopleApi(app) {



module.exports = peopleApi;

//Setting up Guest endpoints

function guest(req, res, next) {
req.guestAccess = true;

authorize.guest, // no authentication required!

// Define ApplyAuthentication function to app.js or server.js

applyAuthentication(app, ['/api']); // apply authentication here

//Define the specific authentication anywhere

var _ = require('underscore');
var middleware = require('../middleware');

function applyAuthentication(app, routesToSecure) {
for (var verb in app.routes) {
var routes = app.routes[verb];

function patchRoute (route) {
var apply = _.any(routesToSecure, function (r) {
return route.path.indexOf(r) === 0;

var guestAccess = _.any(route.callbacks, function (r) {
return === 'guest';

if (apply && !guestAccess) {
route.callbacks.splice(0, 0, middleware.access.authenticatedAccess());

module.exports = applyAuthentication;

Step 3 : Don’t use body-parser()

Source : Here

  • If you go through the post and read that after using bodyparser() the number of temporary files are increased. The only valid question is how is that even a security concern.
  • I use some interesting cloud providers that provide me a limited space and yes if the rate which bodyparser() generates temp files it would make my server process to shutdown until extra space is  reconfigured. Halt in service leaves poor customer feedback.
  • Solution as mentioned is to clean the temp files.

Share Button

Redis vs MongoDB


if you want to build realtime applications then you need to write async functions in node that support in-mem data stores like redis. Although, mostly people are confused on why do we even need redis when we are using something like mongodb ?

The fundamentals can be broken down into :

  • Data Model :


Document oriented, JSON-like. Each document has unique key within a collection. Documents are heterogenous.


Key-value, values are:

  1. Lists of strings
  2. Sets of strings (collections of non-repeating unsorted elements)
  3. Sorted sets of strings (collections of non-repeating elements ordered by a floating-point number called score)
  4. Hashes where keys are strings and values are either strings or integers.
  • Storage 


Disk, memory-mapped files, index should fit in RAM.


Typically in-memory.

  • Querying


By key, by any value in document (indexing possible), Map/Reduce.


By key.


Both can be used for good results (Craig-list uses it).

MongoDB is interesting for persistent, document oriented, data indexed in various ways. Redis is more interesting for volatile data, or latency sensitive semi-persistent data.

  • Redis can be used for user sessions and MongoDB can be used for user data.
  • Redis can be used for advanced features (low latency, item expiration, queues, pub/sub, atomic blocks, etc …) on top of MongoDB.

#Please note you should never run a Redis and MongoDB server on the same machine. MongoDB memory is designed to be swapped out, Redis is not. If MongoDB triggers some swapping activity, the performance of Redis will be catastrophic. They should be isolated on different nodes.

On a higher level :

For use-cases:

  • Redis is often used as a caching layer or shared whiteboard for distributed computation.
  • MongoDB is often used as a swap-out replacement for traditional SQL databases.


  • Redis is an in-memory db with disk persistence (the whole db needs to fit in RAM).
  • MongoDB is a disk-backed db which only needs enough RAM for the indexes.

There is some overlap, but it is extremely common to use both. Here’s why:

  • MongoDB can store more data cheaper.
  • Redis is faster for the entire dataset.
  • MongoDB’s culture is “store it all, figure out access patterns later”
  • Redis’s culture is “carefully consider how you’ll access data, then store”
  • Both have open source tools that depend on them, many of which are used together.
Share Button

NoSql in nutshell


Characteristics :

  1. Non relational
  2. Open-source
  3. Cluster friendly
  4.  Schema-less  : This changes the table for developing the application with relational databases. This improves flexibility in someway. You can add unstructured data in a NoSQL data-store.
  5. No Joints

Data-models for NoSQL

They use different data model of NoSQL databases (4 Chunks):

  • Key/Value data model – You have a key and asks it to grab a value linked to the key. The database knows nothing about that the “value” of the value store. This allows you to save metadata and improved indexes on metadata values. This can be a hash map but is persistent in the disk. They have no set schema
  • Document data model – The data is saved in a complex structure as a document .The best usage is JSON based structured database. They have no set schema. We can query inside this document structure. Their is an ID for indexing.

# The difference b/w last two models is a bit hazy. We can call them aggregate oriented database i.e all the data store has all the data in it without any set schema. In reality, the difference between them don’t matter that much.

# In relational DB we save the aggregate in terms of many tables as it has set schema, without the schema we cannot add a value to the database. In NoSQL we can save the whole complex structure as a data object. In relational we have many aggregate (ex: line item) that asks a object ( Order) i.e a whole unit in it self. Now in NoSQL we are  just moving aggregate i.e Value in Key/value store and document in document data store. Conclusion – we have more flexibility while scaling the application layer.

  • Column family data model – We have a row key that can store multiple column key and column value. It gives you advantage to pull more information from a data query.  This is also schema less.

#Aggregate oriented data model is useful if you want to give and take same aggregate again and again. Its not very useful if you want to slice and dice the database (better use relational database)

  • Graph databases – The notable examples are Neo4J. They break a data into many components and handle them very carefully. This is very different from all the three aggregated oriented databases. This is also schema less. It has a awesome query language.


NoSQL consistency 


# atomicity, consistency, integrity, and durability

#ACID is consistency and people don’t believe that NoSQL is consistent.

Problem : Suppose you have a single unit of information and when you wrote half the data,someone else reads it and vice versa. This would mess things up! We need acid updates to solve this transnational issues.

Solution : Graph databases do use ACID. Aggregated -oriented database don’t actually require ACID. Keep the transactions/ACID in a aggregate limitations i.e any aggregate update is ACID in nature.

Problem : Two users for same app is connecting to front-end   to change values of a data store. if they do it at the same time, how would it work ? Since if we allow changes in same time for same piece of information  – we would be having issues of maintaining consistency .

Solution : in Relational we have transaction that is typically queued for every user. It solves consistency but is not solution for all the systems.  We can have “offline Lock” i.e give each aggregate data a version stamp and when user one pushes updates, user two when finishes can be used to solve the inconsistency. ACID transactions are not the same in NOSQL.

Types of Consistency  :

  1. Logical – Sharding(use one piece of data and put on multiple nodes i.e breakdown).
  2. Replication – Replicate the same data object among multiple nodes. Now you have more data objects to solve this consistency issue in case of node failure.

Problem : user A & B want to book a hotel room. Both are geographically varied. The system has to decide who to give the ticket. Imagine if the communication between two nodes (Country one node & other country node) are down. In this case the system may not be connected hence booking can be made on both the sides creating confusion and issues in real world. How to solve this consistency problem :

Solution : One solution is no bookings until connection is up and other is going even though line is up. So the inconsistency can be solved by business logic in case of choice two. DynamoDB wanted shopping cart to be always live and had many business issues.  So the solution to manage these inconsistency is by business logic.

Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system

RYW (Read-Your-Writes) consistency is achieved when the system guarantees that, once a record has been updated, any attempt to read the record will return the updated value.

A conventional relational DBMS will almost always feature RYW consistency. Some NoSQL systems feature tunable consistency, in which — depending on your settings — RYW consistency may or many not be assured.

The core ideas of RYW consistency, as implemented in various NoSQL systems, are:

  • Let N = the number of copies of each record distributed across nodes of a parallel system.
  • Let W = the number of nodes that must successfully acknowledge a write  for it to be successfully committed. By definition, W <= N.
  • Let R = the number of nodes that must send back the same value of a unit of data for it to be accepted as read by the system. By definition, R <= N.
  • The greater N-R and N-W are, the more node or network failures you can typically tolerate without blocking work.
  • As long as R + W > N, you are assured of RYW consistency.

Example: Let N = 3, W = 2, and R = 2. Suppose you write a record successfully to at least two nodes out of three. Further suppose that you then poll all three of the nodes. Then the only way you can get two values that agree with each other is if at least one of them — and hence both — return the value that was correctly and successfully written to at least two nodes in the first place.

In a conventional parallel DBMS, N = R = W, which is to say N-R = N-W = 0. Thus, a single hardware failure causes data operations to fail too. For some applications — e.g., highly parallel OLTP web apps — that kind of fragility is deemed unacceptable.

On the other hand, if W< N, it is possible to construct edge cases in which two or more consecutive failures cause incorrect data values to actually be returned. So you want to clean up any discrepancies quickly and bring the system back to a consistent state. That is where the idea of eventual consistency comes in, although you definitely can — and in some famous NoSQL implementations actually do — have eventual consistency in a system that is not RYW consistent.

When to use NoSQL ? 

Its a self perception but some main drivers are :

  1. If you have larger data and or is unstructured. Easy to query and program.
  2. People want to program easily for natural aggregate data objects.
  3. People use them for Agile analytics opposite to data warehousing concept. Most people use Graph databases for it.


Share Button

SentAnalysis-py : Code & Screenshots


Last week was crazy ! I have been coding for a kernel module for the simulation project and results went a bit unexpected. I rather tried to explore sentiment analysis for a friend that I was helping in his presentations. Its not that complicated after-all for basic thoughts as a skill-up exercise.

You can use the code after :

  1. You have a output.json {contains tweets from streaming API}. Download a sample version here.
  2. Should know that the syntax is in Python 2.7 so wont work with 3.X. Also I am using AFINN and soon would be using Wordnet {In a complicated way}.
  3. Run it after you have all the required import libraries including Json and oauth2.
  4. Run it as $python . I have added few screenshots.


import sys
import json
import re

def hw(sent_file,tweet_file):
sent_dict = {}

for line in sent_file:
line_list = line.split()
if len(line_list) > 2:
length = len(line_list)
temp_line_list = []
temp_line_list.append(" ".join(line_list[:length-1]))
line_list = temp_line_list
sent_dict[line_list[0]] = float(line_list[1])

for line in tweet_file:
## print "a new tweet"
dict = json.loads(line)
sum = 0;
if 'text' in dict.keys():
text = dict['text']
## print text.encode('utf-8')
words = text.split()

for word in words:
word = re.sub('[^0-9a-zA-Z]+', '', word)
sum += sent_dict.get(word, 0)
print sum
def lines(fp):
print str(len(fp.readlines()))

def main():
sent_file = open(sys.argv[1])
tweet_file = open(sys.argv[2])

if __name__ == '__main__':

Sentiments of the Tweet

Lastly, the plot can be done using matplotlib but I used Google charts for fast depiction.








Share Button