Showing posts with label concepts. Show all posts
Showing posts with label concepts. Show all posts
The Web is made up of many resources and a resource can be any item of interest, for example, an online book store may define book as a resource and clients may access that resource with this URL : http://www.myEbookStore.com/Books.

As a result of accessing the above URL is that the representation of the resource is returned (e.g., books.html) which results in a state change of the client. Now the end users browser is displaying more than just the home page of the online book store, a more informative and detailed state than the previous.
Thus, the client application transfers state with each resource representation. Isn't this same as browsing a website over the INTERNET ???
WWW is like a REST system and many of such services are being used in our day to day activities like purchasing something from Amazon.com, using Facebook, and even using GMail. So you are using REST, and you didn't even know it.

REST stands for Representational State Transfer. REST is not a standard but an architecture. However REST does make use of certain standards like http, URL, XML , html etc.

Consider the case of myEbookStore.com which enables its customers to :
1. get list of books
2. get detailed information about a book
3. purchase books on-line



Get List of Books :
-----------------------
http://www.myEbooksStore.com/books

Note that "how" the web service generates the books list is completely transparent to the client. All that the client knows is, if he/she submits the above URL then a document containing the list of books is returned which is obviously displayed in the browser. Since the implementation is transparent to clients, myEbooksStore.com owner is free to modify the underlying implementation of this resource without impacting clients.So we can consider REST as a loosely coupled architecture.

Here's the document that the client receives:

    <?xml version="1.0"?>
    <p:Books xmlns:p="http://www.myEbooksStore.com"
             xmlns:xlink="http://www.w3.org/1999/xlink">
          <Book id="0120" xlink:href="http://www.myEbooksStore.com/books/0120"/>
          <Book id="0121" xlink:href="http://www.myEbooksStore.com/books/0121"/>
          <Book id="0122" xlink:href="http://www.myEbooksStore.com/books/0122"/>
          <Book id="0123" xlink:href="http://www.myEbooksStore.com/books/0123"/>
    </p:Book>

Note that the books list has links to get detailed information about each book. This is a key feature of REST. The client transfers from one state to the next by examining and choosing from among the alternative URLs in the response document. This is something like zooming the view on Google Maps. If you want to see the map of a location in Pune, the satellite will first zoom onto India, then Maharashtra and then Pune. At first level we get a list of countries from which we select India , then list of states in India from which Maharashtra is selected and then finally we get a list of districts in Maharashtra from which Pune is selected. We can see how data is refined gradually by taking decisions at each level. Lets get back to our BookStore example.

Get Detailed Information about a Book
-----------------------------------------------------
The web service makes available a URL to each book resource.For example, here's how a client requests book 0122:

http://www.myEbooksStore.com/books/0122

Here's the document that the client receives:

    <?xml version="1.0"?>
    <p:Book xmlns:p="http://www.myEbooksStore.com"  
            xmlns:xlink="http://www.w3.org/1999/xlink">
          <Book-ID>0122</Book-ID>
          <Name>JSON explored</Name>
          <Description>This book explains JSON</Description>
          <Versions xlink:href="http://www.myEbooksStore.com/books/0122/versions"/>
          <UnitCost currency="USD">9.20</UnitCost>
          <Quantity>10</Quantity>
    </p:Book>

Again observe how this data is linked to still more detailed data - the versions for this book may be found by traversing the versions hyperlink. Each response document allows the client to drill down to get more detailed information. Thats the whole idea of REST, Representational State Transfer.

In-short lets summarize some important points related to REST/Web Services :

1. Client-Server model, where the client pulls representations.

2. Stateless, meaning state of the data provider is not important. So each request from client to server should contain all the information necessary to understand the request. For example searching google.com for the word "computer" is sent to google server as http://www.google.com/#hl=en&output=search&q=computer ... so the required information is sent from client to server irrespective of the state of the server... that's true because before searching, we never worry about the state of google's server.

3. Common interface. For example : All google search queries are accessed with a generic interface (e.g., HTTP GET, POST, PUT, DELETE) and there is no static page for all searches. Imagine having a static page like : http://www.google.com/computer.html for search results of "computer" keyword... a bad idea.

4. Interconnected representations - the representations of any resource are interconnected using URLs, thereby enabling a client to progress from one state to another.

5. Cache to improve network efficiency. Hence once a website is loaded all the external javascripts needed by the site will be cached.

6. Categorizing the resources according to the requirement of a particular resource. Clients can just receive a representation of the resource, or even modify the resource. For the former, make those resources accessible using an HTTP GET. For the later, make those resources accessible using HTTP POST, PUT, and/or DELETE.

7. Underlying implementation of REST needs to be independent of the URL or type of REST service (GET, PUT, POST, DELETE). This means a website can be built using either JSP or ASP, without impacting the service being provided to the client. Also data can be represented in JSON format or as XML format or any other structured format.

Note :  APIs built using REST or conforming to REST design/architecture are said to be RESTful.

Understanding REST in a RESTful way


Read more about POSIX shell on wikipedia. Below are the set of commands to understand some of the expressions for writing portable shell scripts. Empty cells in the "Example" column is for you to practice.

Exit Status for the "test" command:
0 - The Expression parameter is true.
1 - The Expression parameter is false or missing.
>1 - An error occurred.

SyntaxDescriptionExample
-a FileTrue, if the specified file is a symbolic link that points to another file that does exist.# touch file1
# ln -s file1 linktofile1
# ls -al linktofile1
# test -a linktofile1
# echo $?
0
# rm -f file1
# test -a linktofile1
# echo $?
1
-b FileTrue, if the specified file exists and is a block special file.All files in /dev are special files... they represent devices of the computer. (http://www.lanana.org/docs/device-list/devices-2.6+.txt)
# test -b /dev/ram0
# echo $?
0
-c FileTrue, if the specified file exists and is a character special file.All files in /dev are special files... they represent devices of the computer. (http://www.lanana.org/docs/device-list/devices-2.6+.txt)
# test -c /dev/mem0
# echo $?
0
-d FileTrue, if the specified file exists and is a directory.# mkdir abc
# test -d abc ; echo $?
-e FileTrue, if the specified file exists.# touch file1 ; test -e file1 ; echo $?
-f FileTrue, if the specified file exists and is an ordinary file. # touch file1 ; test -e file1 ; echo $?
-g FileTrue, if the specified file exists and its setgid bit is set.setgid is similar to setuid, with only diff that it will use a 2 instead of 4. (chmod 2XXX file instead of chmod 4XXX file) 
Read more: setgid and setuid

Negative test:
# touch file1 ; chmod 4000 file1 ; test -g file1 ; echo $?
Positive test:
# touch file1 ; chmod 2000 file1 ; test -g file1 ; echo $?
-h FileTrue, if the specified file exists and is a symbolic link.#  test -h linktofile1 ; echo $?
-k FileTrue, if the specified file exists and its sticky bit is set.# test -k file1 ; echo $?
1
# chmod +t file1
# test -k file1 ; echo $?
0
-n StringTrue, if the length of the specified string is nonzero.# str=
# test -n "$str" ; echo $?
1
# str="abc"
# test -n "$str" ; echo $?
0
-o OptionTrue, if the specified option is on.
-p FileTrue, if the specified file exists and is a FIFO special file or a named pipe. (A named pipe can be used to transfer information from one application to another without the use of an intermediate temporary file. Also two separate processes can access the pipe by name — one process can open it as a reader, and the other as a writer.)# mkfifo pcclm
# test -p pcclm ; echo $?
-r FileTrue, if the specified file exists and is readable by the current process.# touch myfile
# su <anotheruser>
# test -r /root/myfile
-s FileTrue, if the specified file exists and has a size greater than 0.# test -s /root/myfile ; echo $?
1
# echo "hello" >>/root/myfile
# test -s /root/myfile ; echo $?
0
-t FileDescriptorTrue, if specified file descriptor number is open and associated with a terminal device.
-u FileTrue, if the specified file exists and its setuid bit is set.
-w FileTrue, if the specified file exists and the write bit is on. However, the file will not be writable on a read-only file system even if this test indicates true.
-x FileTrue, if the specified file exists and the execute flag is on. If the specified file exists and is a directory, then the current process has permission to search in the directory.
-z StringTrue, if length of the specified string is 0. # mystr=""
# test -z "$mystr"
-L FileTrue, if the specified file exists and is a symbolic link.
-O FileTrue, if the specified file exists and is owned by the effective user ID of this process.
-G FileTrue, if the specified file exists and its group matches the effective group ID of this process.
-S FileTrue, if the specified file exists and is a socket.
File1 -nt File2True, if File1 exists and is newer than File2.
File1 -ot File2True, if File1 exists and is older than File2.
File1 -ef File2True, if File1 and File2 exist and refer to the same file.
String1 = String2True, if String1 is equal to String2.# test "a" = "a" ; echo $?
String1 != String2True, if String1 is not equal to String2. # test "a" != "b" ; echo $?
String = PatternTrue, if the specified string matches the specified pattern.
String != PatternTrue, if the specified string does not match the specified pattern.
String1 < String2True, if String1 comes before String2 based on the ASCII value of their characters.
String1 > String2True, if String1 comes after String2 based on the ASCII value of their characters.
Expression1 -eq Expression2True, if Expression1 is equal to Expression2.# test 2 -eq 2 ; echo $?
Expression1 -ne Expression2True, if Expression1 is not equal to Expression2.
Expression1 -lt Expression2True, if Expression1 is less than Expression2.
Expression1 -gt Expression2True, if Expression1 is greater than Expression2.
Expression1 -le Expression2True, if Expression1 is less than or equal to Expression2.
Expression1 -ge Expression2True, if Expression1 is greater than or equal to Expression2.

Understanding some conditional expressions for the Korn shell or POSIX shell

Web and the Enterprise related data is growing at a speed faster than the speed a second ago, an explosion of data. Most of the data is unstructured and we need a way to manage this data or rather generate important information. So how much is BigData really like ? 1024 GB = 1 Terabyte , 1024 Terabytes = 1 Petabyte ... thats massive data and Google processes almost 20 Petabytes of data everyday. So obviously traditional data processing techniques are a waste of time. We need highly optimized data processing technique and yes thousands of machines to get this work done. MapReduce is the way to go for such processing needs.

Data is everywhere :
- Flickr (3 billion photos)
- YouTube (83M videos, 15 hrs/min)
- Web (10B videos watched / mo.)
- Digital photos (500 billion / year)
- All broadcast (70,000TB / year)
- Yahoo! Webmap (3 trillion links,300TB compressed, 5PB disk)
- Human genome (2-30TB uncomp.)


MapReduce is a Programming model ?  OR  Execution environment ?  OR Software package ?  Its all , depending on whom you ask the question.

MapReduce model derives from the map and reduce functions from a functional programming language like Lisp. In Lisp, a map takes as input a function and a sequence of values/lists. It then applies the function to each value/list in the sequence. A reduce combines all the elements of a sequence using a binary operation.For example, it can use "+" to add up all the elements in the sequence and then "-" to reduce it.In Lisp , mapcar applies the function successively to the lists elements in order , producing a new list of values.

(mapcar #'+ '(1 2 3 4) '(10 20 30 40)) => (11 22 33 44)

(reduce '- '(11 22 33 44)) => (- (- (- 11 22) 33) 44) => -88

Here we perform minus operation successively,  11 - 22 = -11 , then -11 - 33 = -44 & then -44 - 44 = -88

MapReduce is inspired by Lisp concepts. It was developed for processing large amounts of raw data like crawled documents or web logs. This being massive amounts of data to be processed (BigData), it must be distributed across thousands of machines to get results in reasonable time. This is similar to parallel computing since the same computations are performed on each CPU, but with a different dataset.

MapReduce is a framework invented by Google in 2004 for running applications (aka jobs) across massive datasets, on huge clusters of machines comprising of commodity hardware capable of processing petabytes of data. It implements a computational paradigm Map / Reduce used in functional programming. In simple terms its a divide and conquer technique where the application data is divided into self contained units of work, each of which may be executed independently on any node in the cluster, a key to Map/Reduce programming model.

MapReduce = Distributed Computation ( on Distributed storage with  Scheduling & Fault Tolrence ) . In general Map/Reduce has two basic steps :

"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes (computer) in the cluster. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Why Map/Reduce ? Performance implications of processing 100 TB of data :
On 1 Node : scanning @50 MB/second ~ 3 years (with uptime , downtime and failures)
On 1000 Node Cluster : scanning @50 MB/second ~ 1 day (with uptime , downtime and failures)

If you are still confused to get a grip about Map Reduce , here's and excellent innovative presentation :



Logically ,
Map function accepts a pair of data with in one data domain, and returns a list of pairs in a different domain:    Map(k1,v1) → list(k2,v2)

This is similar to what Sam did ,  Used a knife (K1) on some fruits ( V1) to produce pieces of each fruit which is similar to list(K2,V2) .

Reduce functions is then applied to V2 to produce a collection of values in same domain :    Reduce(k2, list (v2)) → list(v3)

This is similar to what sam did , used a mixer (K2) and applied it to list-V2 (cut fruits) to create a mixed fruit juice (V3). Each Reduce call will typically produce either one value v3 or an empty return value..(either one glass of mixed fruit juice or multiple glasses of juice)

Lets consider another real example for a smaller set of unstructured data :
We have a some comments from customers of Hotel A, B and C. Lets try to find out the most "Awesome" hotel according to the customers.

Hotel Name, Review
“Hotel C”,”Liked it”
“Hotel B”,”Awesome pool!”
“Hotel C”,”Awesome experience”
“Hotel B”,”Awesome restaurants”
“Hotel A”,”Loved it”
“Hotel A”,”Miserable experience”
“Hotel B”,”Boring”

Map function will create a map with reviews for each hotel. So for Hotel B we would have some thing like :
"Hotel B": "Awesome pool!", "Awesome restaurants", "Boring"

Already map function has helped us gather all reviews for Hotel B. Let the Reduce function do its final job. Reduce function would have an implementation some thing like : "if the review contains 'awesome' word , increment counter for that hotel by 1"

Here's the final output , "Hotel B" would top the list with 2 'awesome' reviews, "Hotel C" with 1, and "Hotel A" with 0 'awesome' reviews. Above example uses only a small set of data , but in real we could have thousands of such review comments for hundreds of hotels.

Below is the view of a Map/Reduce technique on a piece of unstructured data , where we try to find out the occurrences/count of the words cow and dog :


Some more realistic examples that can be directly implemented using MapReduce:

Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.

Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1> indicating one occurrence of the URL. The reduce function then adds together all values for the same URL and emits a <URL, total count> pair.

MapReduce - A way to process BigData

A password hash generated using algorithms like MD5, BSD, SHA1 or other default hashing algorithm is said be a weak hash, since there are known attacks. Its important to using a hashing algorithm like SHA-2 ( SHA-224, SHA-256, SHA-384, SHA-512 ) since till date there are no known attacks. On a UNIX based operating system , passwords are hashed and stored in either /etc/passwd or /etc/shadow file. If the /etc/shadow file is missing on the system, it can be generated by running the command pwconv, which will move the password hashes from /etc/passwd to /etc/shadow and then place character 'x' as a placeholder in passwd file - indicating that the password hash is stored in shadow file.

Linux/Unix systems must employ password hashes using the SHA-2 family of algorithms or FIPS 140-2 approved successors. Use of unapproved algorithms may result in weak password hashes, which are more vulnerable to compromise. Check /etc/passwd and /etc/shadow file for password hashes. Typically /etc/passwd file looks like:

The hash will always begin with a 3 letter identifier - indicating the hashing algorithm. Format of password hash will be "$id$salt$hashed", where $id is the algorithm used. Below table should help :

Alogrithm usedHashed value starts with
BSDi_
MD5$1$
Blowfish$2$, $2a$, $2x$ or $2y$
NT Hash$3$
SHA1$4$
SHA2 (256 or 384 bits)$5$
SHA2 (512 bits)$6$

Typically /etc/shadow file looks like :
 So the easiest way to find out weak password hashes is by analyzing the first 3 characters of the password placeholder field as shown above. You can use a simple shell script to detect this. Before doing this you need to know about some special characters which are never present in a password hash string. Below are the important character sequences:

"NP" or "!" or null - No password, the account has no password.
"LK" or "*" or "*LK*" - the account is Locked, user will be unable to log-in
"!!" - the password has expired

Our shell script should skip such password hashes and only report those which are actual hashes using weak hashing algorithms. Read the files /etc/passwd and /etc/shadow line by line and use the below code to analyze the hash.

#!/bin/sh
algoname="SHA-2"
while read line
do
    checkline=`echo $line | cut -d':' -f2 | grep -v "NP" | grep -v "LK" | grep "^[0-9a-zA-Z./\$][^\*]"`
    if [ -n "$checkline" ]
    then
           algo=`echo $checkline | cut -c 1-3`
           # 'x' means password hash is stored in /etc/shadow file
           if [ $algo = 'x' ]
           then
                  continue
           fi
           if [ ! $algo = '$5$' ] && [ ! $algo = '$6$' ]
           then
                  accname=`echo $line | cut -d':' -f1`
                  echo "User $accname is not using $algoname hashing algorithm."
           fi
    fi
done </etc/passwd


while read line
do
    checkline=`echo $line | cut -d':' -f2 | grep -v "NP" | grep -v "LK" | grep "^[0-9a-zA-Z./\$][^\*]"`
    if [ -n "$checkline" ]
    then
           algo=`echo $checkline | cut -c 1-3`
           if [ ! $algo = '$5$' ] && [ ! $algo = '$6$' ]
           then
                  accname=`echo $line | cut -d':' -f1`
                  echo "User $accname is not using $algoname hashing algorithm."
           fi
    fi
done </etc/shadow

No output from the above script means, all users on the system are using SHA-2 based password hashing algorithm. You can modify the parameters in RED to detect other algorithms also.

Quickly find out the weak password hashes on a unix box

"Warranty void if this seal is broken", that's a common line on any newly purchased electronic device. So if you open the device, its detected. Similar concept applies to world of "Passwords" on a computer. The characters of the plain text password are mixed in using different algorithms to create a hash which is kind of a signature for a stream of data. Another way to explain this according to wikipedia is, "A hash function is an algorithm that transforms (hashes) an arbitrary set of data elements into a single fixed length value (the hash)". So if the original password is "whoami8996" its hash value (using md5sum command on UNIX) is something like "5265638efece6f38bbdc858a5c396fb0", but if I change even one character from the password , its hash will change. These hashing algorithms are designed in such a way that no two different strings will have the same hash value. On Linux/UNIX operating systems you can look at a user's password hash in the /etc/passwd or /etc/shadow file.

This is how a user's account looks like in /etc/passwd file:

piyush:$1$Lq1yUo3c$GF7n.Lwjc0YVhHaYvnawQ1:500:500:piyush:/home/piyush:/bin/bash

Every field is separated by a colon " : " and the second field is the hash value of the user's password. Note: An x character instead of the hash indicates that encrypted password is stored in /etc/shadow file. One good thing about the hash is, its of same length for passwords of different lengths ( using the same algorithm to create a hashed value) . In above case MD5 algorithm is used to create the hash. How did I know that ??? Lets try to analyze the hash- $1$Lq1yUo3c$GF7n.Lwjc0YVhHaYvnawQ1.

Every hash has a unique identifier string, like $1$ at the start of the hash in above case indicates that the hash was created using MD5 algorithm. Below table provides the list of algorithms and their identifier strings :

Alogrithm usedHashed value starts with
BSDi_
MD5$1$
Blowfish$2$, $2a$, $2x$ or $2y$
NT Hash$3$
SHA1$4$
SHA2 (256 or 384 bits)$5$
SHA2 (512 bits)$6$

In the above table, algorithms are listed in the order starting with weakest (prone to attack or more vulnerable to compromise) to strongest ( no known attacks or require very long time to attack/compromise using methods like brute force ). Some common password hashing schemes only process the first eight characters of a user's password, which reduces the effective strength of the password. So how do you update the hashing algorithm used ?

Detect hashing algorithm used :
#authconfig --test | grep hashing
Sample output: password hashing algorithm is md5
Another way to detect:
egrep "password .* pam_unix.so" /etc/pam.d/system-auth-ac | egrep "sha256" | egrep -v "^[ ]*#"
The above command needs to be altered according the linux distribution used. Below are some files to look for:
/etc/pam.d/common-password (Debian)
/etc/default/password OR /etc/default/passwd (SUSE/Novell)
/etc/pam.d/system-auth-ac (Red Hat Enterprise Linux - RHEL)
/etc/security/policy.conf (Oracle Solaris)
/etc/security/login.cfg (IBM AIX)
Update the hashing algorithm (RHEL only, for others try editing the files above):
#authconfig --passalgo=sha512 --update
The above update will be applicable only for those users who will change their passwords post the setting update. Existing users who have not changed their password, will still continue to use the previously set hashing algorithm. However we can enforce the users to change their passwords on next login by setting the password expiry age to 0 days for a required user.
Force user to change password:
#chage -d 0 user-name
OR
#passwd -f user-name

Update password hashing algorithm for Linux / Unix

All databases have logs associated with them which keep a record of changes to the database. Lets consider IBM DB2 database for understanding the database logging techniques.

 A database that uses archival logging can be backed up online. To reach a specified point in time, you can perform a rollforward recovery. A database that uses archival logging is therefore also called recoverable.There is another type of logging , Circular logging which keeps all restart data in a ring of log files. It starts logging in first file in the ring, then moves on to the next, and so on, until all the files are full. Circular logging overwrites and reuses the first log file after the data it contains has been written to the database. This continues as long as the product is in use, and has the advantage that you never run out of log files.


In circular logging only full backups of the database are allowed while the database is offline. The database must be offline (inaccessible to users) when a full backup is taken. So consider the case of a real time database which needs to be backed up and uses circular logging. For example a telecommunication company will want to log all details of a user's call details.A call detail record contains details of a telecommunication transaction, such as:

1. Phone number of the calling party
2. Phone number of the called party
3. Call start time and date
4. Call duration
5. Identification of the telephone exchange or equipment writing the record
6. A unique sequence number identifying the record
7. Additional digits on the called number used to route or charge the call
8. Call type (voice, SMS, etc.)
9. Any fault condition encountered
Now suppose a database backup is to be done for which it needs to stopped or taken to an offline mode. However users will continue to make or receive calls. We as mobile users are not aware when our service provider will take a database backup. Hence if circular logging is used , backups lose any incoming data while the backup operation is in progress.



Archive logging is exactly opposite to circular logging. Online, Incremental and Delta backups are supported only if the database is configured for archive logging. All activities against the database are logged during an online backup. After an online backup is complete, the database manager forces the currently active log to be closed, and as a result, it will be archived. As a result of this, online backup has a complete set of archived logs available for recovery. In simple terms:

Archive Logging = Serial Logging ( no overwrite of log files as in circular logging )


When an online backup image is restored, the logs must be rolled forward at least to the point in time at which the backup operation completed. Circular logging can recover data only to a specific point in time at which full backup was taken, it is also known as version recovery. Archive logging can recover data to any point in time hence also known as full recovery or rollforward recovery

The advantage of choosing archive logging is that recovery tools can use both archived logs and active logs to restore a database either to the end of the logs, or to a specific point in time. The advantage of using circular logging is that you never run out of log files or storage space issues.
Below is an easy to understand comparison between the two logging methods.

TypeOnline backupsTable space backupsRecover to any point in timeAutomatic log file managementPerformanceMaintenance
Circular LoggingNoNoNoYesHighLess compared to circular
Archive LoggingYesYesYesNoReduced than circularMore compared to circular

Database logging mechanisms


When you read the word snapshot what is the first thing that comes to your mind is , "A Photograph which preserves the best moments of your life". Technically snapshot is very much the same with the difference that it preserves the state of some digital resource. In VMware a disk "snapshot" is a copy of the VM's disk file (.vmdk) captured at a certain point in time. This snapshot preserves the disk file system and the files stored on it which can be of any type (including all the operating system files). So if something goes wrong with your virtual machine, you can restore it to a snapshot which was working previously.

One can also create snapshots for different versions/service-packs on an OS.
Hence snapshots can also be looked upon as version controlling mechanism at OS level. So if your computer was shut down abruptly or gets infected by virus, just revert to a snapshot.

So how do snapshots really work ? There's just one Thumb Rule to VMware's snapshot technology : "Snapshots only store the differences between the current state and the original state". It follows the copy-on-write scheme with every subsequent disk access. Lets try to understand what that means ...

Consider that you have a text file with the word "COMPUTING" stored in it.
This file is a sparse in nature : which means it spans across multiple blocks on the disk. Step 1 below demonstrates this scenario. The black lines indicate the links on to the stored data. For demonstration purpose lets consider that each block on disk has only one character.


Note : The blocks shown above contain only one character and is purely for example purpose. In real the block size could be of say 1MB or a sector on disk.
 
Now when you take a snapshot another file named Snapshot1.vmdk will be created. When you create a snapshot, any changes made on the original virtual disk image are not made on the original disk, but they are written to a new (snapshot) disk file. This action is very fast as there is no need to copy whole virtual disk image.

Thumb Rule : "While saving changed data blocks in a snapshot, all modified block will be saved first , followed by blocks which were deleted as compared to base disk blocks." As seen in Step 2 , blue block is linked at the end of the snapshot1.vmdk

Lets suppose that you take a snapshot after you have saved the word "COMPUTING" in the file. After the snapshot you modify the file by changing its last two blocks (letters N and G circled above) and clear the letter I. The new changed word is "COMPUTER" as show in the Step 2. The blue block above is nothing but an empty block created by deleting letter I. The blocks in RED represents the new snapshot1.vmdk disk which contains only the changed characters.

Thumb Rule : "While reading any file in current state read only the data accessible by first level links, irrespective of number of snapshots and original data of the file."

Reading the first links of the "Current State" disk from Step 2 (Green blocks) , word "COMPUTER" is retrieved and the size of snapshot1.vmdk is 3 blocks (2 filled and one empty). However since its a differential snapshot file its size is much less than original base disk (9 blocks).  Snapshot image size grows as you continue to change more and more data from your original virtual disk image (which remains untouched from the moment you took the snapshot).

Thumb Rule : "Size of a snapshot will always be less than the base disk, but in worst case it will be exactly the same size if all blocks were to be changed."


As seen in Step 3 , we now take another snapshot after saving the word "COMPUTER" in the file. On making more changes after snapshot2 , similar process is followed to create snapshot2.vmdk file. The new changed word in the file is  "CONTAINER". As a result now neither snapshot 1 nor the base file are written two, but are still referenced. Snapshot 2 will store the new changes as compared to snapshot 1. If you read the first level links of the green blocks from top to bottom , the word "CONTAINER" is read with the fact that only 5 letters are stored in snapshot2.

Below are the list of changes made to the file
--------------------------------------------------------------------------
Step1 - Base Disk  = COMPUTING
Step2 - Snapshot1 = COMPUTER
Step3 - Snapshot2 = CONTAINER

From the above scenarios its clear that taking snapshots in VMware involves only writing the differences in files changed from the time of the snapshot, not the complete virtual machine disk. This mechanism is similar to taking diff and patch in Unix, but in a more sophisticated way that diffs on a binary level with the knowledge of how a VMFS ( Virtual Machine File System ) is structured.

Now we have a clear idea about the Copy-On-Write Protocol - Every time a block from the base disk or previous snapshot is changed , copy it to the current delta or snapshot file, which implies that when you perform a snapshot restoration, it only has to rewrite the sectors that were modified since you took the snapshot. As a result snapshot revert is also super fast.

But the Question is, What happens when you revert to an older snapshot? The VM software throws away the contents of snapshot2.vmdk and starts over with  reading contents from snapshot1.vmdk. During this all the Blue links in Step3 are replaced with Green Links. ( this is a similar strategy followed in deleting a node from a linklist ... Pure programming stuff ). Note : Links to Snapshot1 and Base are not changed.

There are two more important aspects of Snapshot management : Discarding a snapshot and Merging a snapshot (reverting to a non-immediate parent snapshot).

A KB article from VMware on Snapshot management.
http://www.youtube.com/watch?feature=player_embedded&v=rj8ugLrmU-M

How do Virtual Machine Snapshots work in VMware

Host, Clusters and Resource Pools together form the skeleton for any virtualization technology so that the virtualization software (hypervisor) can consume it to present a virtual machine to the end user. They are the building blocks of any virtualization platform (eg: VMware). Just like a normal enterprise application would have a front-end , business logic & back-end , Virtualization Applications are also very similar.
Now the best part of Virtualization application in the above diagram is that, the last layer (Hardware) is dynamic - in the sense that resources can be added and removed as per the needs.

A host is nothing but a high-end physical computer with only computing and memory resources. The number of CPUs for a Host is fixed , however there is an option of increasing or decreasing the RAM for a Host (with a constraint to max upper-limit ). Storage or data-stores can be added to the Host as per the need. This entire concept of host is same as that a physical desktop or laptop. But a host is much powerful than a desktop/laptop. 

For example : A host with 4 dual-core CPUs each running at 3 GHz  and 32GB of memory will have 24GHz (dual-core = 4x2 ) of computing power and 32 GB of RAM available for running virtual machines on top of the host.

Now how do we scale this ??? You guessed it right ... combine multiple hosts together. This is nothing but a Cluster. So if a Cluster has 4 hosts then a total of 24GHz * 4 = 96 GHz computational power , 32 GB * 4 = 128 GB of RAM is available for Virtualization

How does clustering help ? Now the hypervisor software sees the each underlying hardware as a single entitiy. Which means if we want to create a Virtual Machine with 50GHz computational power and 50GB RAM then creating it with a single host is not possible. However creating it on a cluster is possible. One can relate Cluster as a solution to the problem of Defragmentation. Lets see this :



So what's a resource pool ? Well its grouping of resources like CPUs and Memory to that they can be allocated as per the business needs for a particular department in the company.  Which means it makes more sense to create a group of resource pools from a cluster than a individual hosts. Resource Pools are dynamic and hence resources reserved can be dynamically changed, modified or removed. Lets consider a scenario of a Software Company Project with one manager , two developers and three test engineers.

Any Resource Pools can be partitioned into smaller
Resource Pools at a fine-grain level to further divide and assign resources to different groups or for different purposes .

Obviously more machines and resources will be required by the test team inorder to test a piece of software on different operating systems. Comparatively developers would require lesser number of VMs but each VM should be powerful enough. We can easily load balance the resources as the team expands or shrinks. If the development team is not utilizing the resources to its peak and the test team needs more CPU/RAM , we can easily adjust the resource pool.

As a result resources are not wasted if they are not being used to their max capacity.Hence resource pools can be nested, dynamically reconfigured and organized hierarchically .

Individual business units can use their own dedicated infrastructure while still benefiting from the efficiency of resource pooling. Isn't this something like an " À la carte " where we use resources only as per the requirement.

What are Hosts, Clusters and Resource Pools

Virtualization is an abstraction layer that sits in between the physical hardware and the operating system. Virtualization is a methodology of dividing the resources of a computer into multiple execution environments, by applying one more concepts such as Hardware and Software Partionioning, time-sharing, emulation, simulation and on-demand utilization.

Virtualization allows multiple virtual machines, with different flavours of operating systems like Windows, Linux, Solaris and the respective softwares/applications to run in seperately, just like running multiple physical machines seperately.

What are Virtual Machines ?

Simulation of a Physical Machine in the form of a Software is known as a virtual machine. With the help of virtualization softwares (Hypervisor) the operating system is made to believe that the hardware on which it is running is real and is fully owned by it and no other OS is sharing it, while the Hypervisor provides all the interfacing between the OS and the underlying hardware in a sharing and on-demand mode.

A VM has its own set of virtual hardware (e.g., RAM, CPU, NIC, hard disks, etc.) upon which an operating system is loaded. The operating system sees a normal set of hardware regardless of the actual underlying  physical hardware.
New virtual machines can be created in seconds, without the need of any purchase order and any physical space to worry about. Once a virtual machine is
provisioned, the required OS and softwares can be installed on the VM just like a
physical machine.

Simple meaning of Virtualization and Virtual Machine


Go Green is the buzz around after cloud computing, but the real concern is can each one of us contribute towards this initiative to save our planet .According to a survey, 5 million hectares of forests are cut down every year just to fulfill our everyday need for paper. A need we fulfill in a very careless, selfish and a wasteful manner.How many times have we thought if we really need the print out of a document ?

Why do we need a bill after we eat something at a hotel ? we can save paper by making use of a small digital screen on every table of the hotel, may be a black n white screen found in old digital diaries can also be used. This will save paper and ink. Another excellent idea from CHEIL India - MinusOneProject is also shown in the below.



Do your bit by reducing the font size by 1 before printing a document. Save paper. Help the forests. Reduce the font-size by one before printing and save our forests.

Finally something techie ,
More Forests = Print( doc(font-size--) ) ;

So how will you spread this message ???  take a mass pledge in your office,Spread Minus One in your school/college/organization, publish an article on Minus One or  JUST SHARE THIS LINK/ARTICLE.

Here's video on the Minus One project :


Go Green with Minus One

Lets try to understand what really cloud computing is all about , what are the aspects to be considered when discussing about CLOUD and what type of cloud solutions are available in the market. Well , cloud computing is not working on a laptop in a plane but its more than that.Its a model to optimize the hardware and software usage of every individual working on a computer.

Cloud computing is a complete paradigm shift from locally self owned computers to centrally pooled processing power which can be rented on demand at any time for any time.Cloud Computing changes the way organizations views IT. With this new dedicated server technology, one can simply rent the computing power your organization requires, eliminating the need to predict and invest capital in the computing needs of the organization.
Cloud is just another dimension to the world of Internet. Cloud computing is more of Internet computing or in other words, using Internet as a computing infrastructure and resource. In cloud computing, Internet is used to provide services such as File backup , data storage, running software applications, multimedia services , and email and file exchanges. So in future, the computer OS would just have a Internet browser that can stream any thing from office apps to playing music and videos to developing a software ... all in one single browser.This mechanism is known as application streaming or application virtualization.

Lets have a walk on the clouds ...
Consider a basic example – your organization is a user of Microsoft Excel  and typically would have a client-server networked environment with Excel application running off an application server and/or off numerous Microsoft Windows - based desktop PCs and laptops – collectively known as fat clients. In a cloud computing environment, your organization invests in thin clients (low cost scaled-down desktop PC and laptops) networked to the server to access a spreadsheet application from the “cloud”. In the case of an individual, if you have a broadband connection to the Internet, you can use an inexpensive laptop, for e.g. a netbook to access a spreadsheet application from the “cloud”. Furthermore, most cloud computing services for the individual would be free.

Cloud computing is a model that enables on-demand network access to a shared pool of configurable computing resources such as networks, servers, storage, applications, and services that can be rapidly provisioned and released with minimal human intervention.

Major components of Cloud include Software , Platform to Infrastructure with all three collectively serving the clients.All these allow users to run applications and store data online.
Some of the essential characteristics of Cloud Computing are : On-demand self-service , Broad Network Access , Resource pooling ( different physical and virtual resources dynamically assigned and reassigned according to consumer's demand ) , Rapid Elasticity (the ability to quickly scale with the incoming requests for resources and scale down when they are released).


Three Models of Cloud Computing :

SaaS - Software as a Service allows users to run existing applications online.Here the consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.

Paas - Platform as a Service allows users to create their own applications using supplier specific tools and languages.Again like SaaS , consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.

IaaS - Infrastructure as a Service allows users to run any application of their choice on cloud hardware.This means a dedicated amount of processing power along with storage space will be allocated for the user.This is like having a computer (Virtual Machine) on the cloud and managed by cloud.The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).

The below diagram should help you easily understand the difference in control with the 3 types of cloud offerings, IAAS, PAAS, SAAS ...


With Cloud Computing , there is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.

Cloud Computing - A Definition by NIST

If someone asks you , What is Cloud Computing ?  what would you answer ? . Funny part is even if you don't know about cloud computing you can just answer saying your CPU is in cloud and the monitor at your desk. Just kidding... its not that simple neither its complex to explain.

"Stop buying those boxes , cloud is not a box".

So what is cloud computing really about ? Why is there so much of noise about CLOUD ? Consider that you have purchased a new computer ($500) and you just boot it and wanna edit a word document on it.Well soon you realize that you don't have an office suite installed on your computer.So what do you do now ??? buy an official Microsoft Office suite for about $700 ??? naaaaa ... just too expensive.If you are a cunning person you would search the Google for "Open Office" which doesn't cost you and satisfies your needs of a office suite but not exactly what Microsoft Word and Excel would provide.So have you solved your problem ??? not exactly because soon you may need another software which may not be easily available. So lets put this option in "The Cloud". The reason for putting this solution in cloud is , we just don't want to spend money !!!

So how do we go about editing some documents which you have ? Well if you have a GMAIL account , you can edit these documents (excel , word  or ppt) on the web, free of cost...Thanks to Google Documents service.This model is nothing but SaaS - Software as a Service. The Office suite is installed at a central location at Google's Office and people can just leverage its functionality without having to install the application locally on your computer.Well, is this Cloud Computing ? the answer is YES.... So if you are using Google Docs , you are doing CLOUD Computing.


So now the cloud can save you a lot of money , time and reduce your headaches.There's still lot more to do with cloud computing other than office suite....

Bill Gates once said , "The computer was born to solve problems that did not exist before..."

What else can we do with cloud computing ? File Storage , Online Backup , Photo Sharing , Music and Video Streaming , Running Business Applications and more ....
Whether you are a current user of “the cloud”, or you have decided of becoming a user, cloud computing is here to stay and change our lives... Try to observe your day to day tasks and use them in the cloud to make your life simpler , faster and smarter.

Related Article : Cloud Computing - A Definition by NIST

Cloud Computing - Its Bright and Sunny outside

Cloud Computing explained in simple English with a simple example for a layman without any knowledge of what cloud computing is all about.

Simple video :



Now lets get into the cloud :

Cloud Computing - A Backup of your Memories

PC Power Management is critical for every IT organization as it helps minimize the environmental impacts and save money. The real goal is use computer on-demand , which means they should consume power only when some work is being done by the end-user and save energy when not it use. The scenario is just like our daily usage of room lights and fans - use them when you are in the room , turn them off while leaving the room.Now the question is how can we do this for computers ? 

Things to Consider for Desktop Power Management

Steganography is the practice of hiding confidential or sensitive information within something that appears to be nothing out of the usual.Steganography is often confused with cryptography because the two are similar in the way that they both are used to protect important information.

If a someone views the object that the information is hidden inside of  , he or she will have no clue that there is any hidden information in it. As a result the person will not try to decrypt the information/object .That is the beauty of Steganography.

Steganography in depth


Steganography is the art and science of hiding information by embedding messages within media like images.The main purpose of digital steganography is often to create a message that defies detection.

There are number of file formats in which data is redundant or some data is of little importance.Digital steganography exploits this fact and the hidden message does not cause noticeable changes to the file.It is used in graphics files, HTML, sound files, video, and text files, for example, but image files are favored and referred to as stego-images.

The Art of Hiding Information in the Digital World

Any crime involving a computer or a network is referred to as Computer Crime - harmful act committed from or against a computer(cyber crime) or network(net crime).

Practically there is no reliable data on the amount of computer crime and the physical/economic loss to victims, mainly because many of these crimes in the digital work of 1 and 0 remain undetected.


Why is it dangerous than terrestrial crime ?
Estimates are that computer crime costs victims in the USA at least US$ 5×10,00,00,000 / year.

What is Computer Crime ?

In addition to normal software and hardware inventory information of an endpoint, its better that we also collect some other critical information for endpoint analysis,threat and security breach detection.

Extended Software Inventory

Users

 Collect information about user accounts on the endpoint.
 Information to collect  :
 
 1. User Name
 2. Domain to which the user is registered to
 3. Password Required or not
 4. Has Password expired or not
 5. Account Disabled or not
 6. User's group and quota details
 7. Status ? Account blocked due to bad password attempts etc.
 8. Last logged user
 

Services


 Service is a long-running executable that performs specific functions
 which is designed not to require user intervention.

 Information to collect :
 
 1. Display Name
 2. Service Name
 3. Path to Service executable
 4. Service type. ( eg. own process or share process )
 5. isStarted
 6. Start Mode (Manual or Automatic)
 7. State (running , paused , stopped)
 8. OwnerUserName (System , administrator)
 9. Service using maximum resources ( CPU , RAM )

CPU Meter 

 Information to collect :
 
 1. CPU Speed 
 2. Idle Time in %
 3. User Time in %
 4. Privileged Time in %
 5. Processor Time in %
 6. Total number of Processes
 7. Processor Queue Length

  Anti-virus Protection 

 Information to collect :
 
 1. Name of Anti-Virus Software Installed
 2. Service Names for the Anti-Virus
 3. Latest Definitions
 4. Last Scan Date
 5. Is Auto-scan enabled
 6. Is Auto-update enabled
 7. Health Status - Healthy , Need Update , Not Running ,
     Not Installed 
  Operating System Info.  


 Information to collect :
 
 1. Full OS Name and Service Pack Level
 2. OS Version Number
 3. OS Type
 4. Product ID
 5. Product Key (Win95, Win98, WinME only)
 6. Installation Date
 7. Uptime (days)
 8. OS Language (Language of the installed OS)
 9. System Language

 
  
Adobe Product Info.  


 Information to collect :
 
 1. Adobe Reader
 2. Adobe Acrobat
 3. Adobe Photoshop
 4. Adobe Photoshop Elements
 5. Adobe Illustrator
 6. Adobe InDesign
 7. Adobe GoLive
 8. Adobe ColdFusion
 9. Adobe Flash Player (IE)
 10. Adobe Flash Player (Mozilla)
 11. Adobe Shockwave Player
 12. Adobe Director

 
 Microsoft Remote Desktop  

 Information to collect :
 
 1. Remote Desktop - Enabled Status
 2. Remote Assistance Offering - Enabled Status
 3. Remote Assistance Offering - Helper Control Level
 4. Remote Assistance Offering - Authorized Assistance Users
     (users or groups who are authorized to offer remote assistance)
 5. ScreenSaver enabled in Remote Desktop Session
 6. Maximum Remote Desktop Connections




Related Article : Collecting Extended Inventory Data

Extended Software Inventory for Endpoints

In addition to normal software and hardware inventory information of an endpoint, its better that we also collect some other critical information for endpoint analysis,threat and security breach detection.

Extended Hardware Inventory


 
       Printers
 Collect information about printers connected to the endpoint.
 Information to collect  :
 
 1. Printer Name
 2. Driver Name and Version
 3. Is Local or Network Printer.
 


  USB Devices  


 Information to collect :
 
 1. Type of USB device. eg : Mass Storage, USB Hub, smart card reader etc. 
 2. Manufacturer and Vendor ID. eg : Lenovo , Samsung
 3. Port Number on which the device is connected.
 4. Serial Number. eg: every pen drive has a unique serial number.
 5. Device Class (reserved , hub etc) and Device address
 6. USB Version (1.1, 2.0 etc) and Host Controller (generally 0 except for USBHub)


PCI Devices  


 Typical PCI cards used in PCs include: network cards, sound cards etc.
 Modems,extra ports such as USB or serial, TV tuner cards and disk controllers 
 are also included as PCI devices.

 Information to collect :
 
 1. Name of the PCI - Peripheral Component Interconnect device 
 2. Type - Integrated onboard or Expansion slot

  Modems

 Information to collect :
 
 1. Provider Name , Manufacturer
 2. Type - Internal , External
 3. Port Number.  eg. COM3
 4. Port Speed. eg. 115200
 5. Port Settings  eg. 8N1
 6. Inf file name

  Monitor  


 Information to collect :
 
 1. Name
 2. Type - LCD , CRT
 3. Manufacturer and year manufactured
 4. Screen Resolution
 5. Color Depth ( eg. 32bit )
 6. Size in Inches
 
  Keyboard  


 Information to collect :
 
 1. Type - Standard 101,102, PS/2,Natural
 2. Number of Function Keys
 3. Manufacturer
 
Pointing Devices  

 Information to collect :
 
 1. Number of buttons (2 , 3 , with/without scroll)
 2. Model
 3. Manufacturer

Collecting Extended Inventory Data for Endpoints

Collecting software and hardware related information of a particular endpoint (any device on a connected network) is known as inventory management or IT Asset Management (ITAM). Inventory Management is a key feature to support endpoint life cycle management.



Goals of Inventory Management

1. Gain control over assets (all elements of software and hardware) in your business environment.
2. Manage IT costs and return on investments (ROI).
3. Ensure compliance of all endpoints.
4. Risk reduction by detecting lost assets (eg. certain endpoints are not reachable for a long time indicating loss).
5. Enforcing policies on black-listed software to avoid any security breach or loss of confidential information or threat of spreading virus in your IT environment.
6. Keeping hardware and software configuration up-to-date in your environment.

ITAM helps easily track hardware information, installed software packages, and operating system settings for all IT assets in an IT enterprise.

Example to find out the inventory on your Windows Box

From your command prompt run : msinfo32

A comprehensive System Information related UI is launched.



To open System Information in History view, type:
msinfo32 /pch

To create a .txt file in the folder C:\Temp with a name of Test.txt, type:
msinfo32 /report C:\TEMP\Test.txt

To view System Information from a remote computer with a UNC name of BIGSERVER, type:
msinfo32 /computer BIGSERVER

You can find more information at MSINFO32

A typical inventory management process :




Inventory Management helps an IT enterprise answer some key questions about its assets :

Where is the Asset ? Location tracking - eg : where across the globe
What is its condition ? condition tracking  - eg : working , non-working
What is its status ? Status monitoring - eg : compliant , over-utilized
Where in my network is it ? Network tracking - eg : behind the firewall , on public IP or particular subnet


Inventory Management for Endpoints

+