Sunday, 22 July 2012

PERFORMANCE TUNING IN INFORMATICA


PERFORMANCE TUNING IN INFORMATICA
Performance Tuning is done to improve session performance. To tune a session, performance bottlenecks are identified and eliminated. The bottlenecks should be looked in following places in the given order:

1. Target
2. Source
3. Mapping
4. Session
5. System

General Techniques to identify bottlenecks:

1. Thread statistics should be analysed to identify source, target or transformation bottlenecks. By default, the Integration Service uses 1 reader thread, 1 tranformation thread and 1 writer thread to process a session. The thread with the highest busy percentage identifies the bottleneck.

2. Add a filter transformation after each source qualifier. Set the filter condition to false so that no data is passed fromk filter. If the time to run the session remains about the same, there is a source bottleneck.

3. Add a filter transformation before each target definition. Set the filter condition to false so that no data is loaded into the target. If the time taken to run the session is almost same as before, there is a mapping bottleneck.

 General Techniques to improve session performance:

1. If a target bottleneck is identified, consider dropping key constraints and indexes on the target tables before the session is run. Rebuild those constraints and indexes after the session run.

2. Use bulk loading incase of insertion of large amount of data. When bulk loading, the Integration Service bypasses the database log which speeds up performance. But there is a trade-off here. Without writing to the database log, the target database can not perform rollback.

3. If multiple source tables are joined in one source qualifier, optimize the query by using hints and/or adding indexes on the joined columns, GROUP BY or ORDER BY columns. Configure the database to run parallel queries to improve performance.

4. Reduce unnecessary transformations in the mapping and delete unnecessary links between transformations to optimize the mapping.

5. If the session reads from a flat file source, the performance can be improved by setting the number of bytes the Integration Service reads per line.By default, the Integration service reads 1024 bytes per line. If each line in source file is less than the default setting, the 'line sequential buffer length' should be decreased in the session properties.

6. Use filters as early as possible in the mapping.

7. Eliminate unnecessary datatype conversions. Try to use integer values in place of other datatypes when performing comparisons using lookup and filter transformations.

8. The integration service reads expressions written with operators faster than the expressions with functions. Wherever possible, operators should be used. e.g in place of CONCAT function use || operator.

9. When Aggregator Transformation is used, following guidelines should be used -
  •  
The group by should be done on simple columns. If possible, the number type columns should be preffered for GROUP BY.
  •  
Use sorted input. When sorted input option is used, the Integration Service assumes that all data is sorted by group. It does not wait for the whole data to come into memory for aggrgation.
  • Use incemental aggrgation if the changes in source affect less than half the target. In this case, the Integration Service updates the target incrementally, rather than processing the entire source and recalculating the same calculations everytime you run the session.
  • Filter data before aggregation. This will avoid unnecessary aggregation.
  • Unncessary port connections should be avoided to reduce the amount of data the aggregator transformation stores in the data cache.
10. When using Joiner Transformation, following things should be considered:
  • In case of large data sets, configure joiner transformation to use sorted input. In this case, the Integration service improves performance by minimizing disk input/output. Also, designate the master source as the source with fewer duplicate key values. When Integration Service processes a sorted joiner transformation, it caches rows for 100 unique keys at a time. If the master source contains many rows with the same key value, the Integration Service needs to cache more rows and hence performance can be slowed.
  • Designate master source as the source with fewer rows. This is because during a session run, the joiner transformation compares the value of each row of the detail source against the master source. The fewer rows in the master, the fewer iterations of the join comparison occur, which speds the join process.
  • Perform join in the database or source qualifier transformation if possible
11. When using Lookup Transformation, following things should be considered:
  • If you have privileges to modify the database containing a lookup table, you can improve lookup initialization time by adding an index to the lookup table.
  • You can improve performance by indexing the columns in the lookup ORDER BY.
  • By default, the Integration Service generates an ORDER BY clause for a cached lookup. The ORDER BY clause contains all lookup ports. To increase performance, you can suppress the default ORDER BY clause and enter an override ORDER BY with fewer columns. Place two dashes ‘--’ as a comment notation after the ORDER BY clause to suppress the ORDER BY clause that the Integration Service generates.
  • If you include more than one lookup condition, place the conditions in the following order to optimize lookup performance:- Equal to (=),Less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=), Not equal to (!=)
  • Improve session performance by caching small lookup tables.
  • If the lookup table is on the same database as the source table in the mapping and caching is not feasible, join the tables in the source database rather than using a Lookup transformation.
12. If possible, run sessions and workflows concurrently to improve performance.

13. Increase the commit interval. By doing so, the number of times, Integration Service commits data decreases and hence performance is gained. However, if the commit interval is too high, the Integration Service will fill the database log file and cause the session to fail.

14. If the session runs if 'high precision' setting enabled, then disabling this property may reult in performance gain.

15. Set the proper error tracing level in session. e.g Verbose Data setting should only be used for debugging mappings and should be avoided in the normal run.

16. Use pipeline partitioning to improve session performance. Increasing the number of partitions or partition points increases the number of threads. If the Integration Service nodes contain enough CPU bandwidth, this approach can significantly improve performance.

Informatica Scenarios1


Informatica Scenario based Questions and Solutions (Part-1)
Scenario1:

We have a target source table containing 3 columns : Col1, Col2 and Col3. There is only 1 row in the table as follows:

Col1 Col2 Col3
-----------------
  a       b       c

There is target table containg only 1 column Col. Design a mapping so that the target table contains 3 rows as follows:

Col
-----
a
b
c

Solution: Not using a Normalizer transformation:

Create 3 expression transformations exp_1,exp_2 and exp_3 with 1 port each. Connect col1 from Source Qualifier to port in exp_1.Connect col2 from Source Qualifier to port in exp_2.Connect col3 from source qualifier to port in exp_3. Make 3 instances of the target. Connect port from exp_1 to target_1. Connect port from exp_2 to target_2 and connect port from exp_3 to target_3.

Scenario 2:

There is a source table that contains duplicate rows.Design a mapping to load all the unique rows in 1 target while all the duplicate rows (only 1 occurence) in another target.

Solution :

Bring all the columns from source qualifier to an Aggregator transformation. Check group by on the key column. Create a new output port count_col in aggregator transformation and write an expression count(key_column). Make a router transformation with 2 groups:Dup and Non-Dup. Check the router conditions count_col>1 in Dup group while count_col=1 in Non-dup group. Load these 2 groups in different targets.

Scenario 3:

There is a source table containing 2 columns Col1 and Col2 with data as follows:

Col1   Col2
 a          l
 b         p
 a         m
 a         n
 b         q
 x          y

Design a mapping to load a target table with following values from the above mentioned source:

Col1    Col2
  a        l,m,n
  b       p,q
  x        y

Solution:

Use a sorter transformation after the source qualifier to sort the values with col1 as key. Build an expression transformation with following ports(order of ports should also be the same):

1. Col1_prev : It will be a variable type port. Expression should contain a variable e.g val
2. Col1 : It will be Input/Output port from Sorter transformation
3. Col2 : It will be input port from sorter transformation
4. val : It will be a variable type port. Expression should contain Col1
5. Concatenated_value: It will be a variable type port. Expression should be decode(Col1,Col1_prev,Concatenated_value||','||Col2,Col1)
6. Concatenated_Final : It will be an outpur port conating the value of Concatenated_value

After expression, build a Aggregator Transformation. Bring ports Col1 and Concatenated_Final into aggregator. Group by Col1. Don't give any expression. This effectively will return the last row from each group.

Connect the ports Col1 and Concatenated_Final from aggregator to the target table.

Scenario 4:

Design an Informatica mapping to load first half records to 1 target while other half records to a separate target.

Solution:

You will have to assign a row number with each record. To achieve this, either use Oracle's psudo column rownum in Source Qualifier query or use NEXTVAL port of a Sequence generator. Lets name this column as rownumber.

From Source Qualifier, create 2 pipelines:

First Pipeline:
Carry first port Col1 from SQ transformation into an aggregator transformation. Create a new output port "tot_rec" and give the expression as COUNT(Col1). Do not group by any port. This will give us the total number of records in Source Table. Carry this port tot_rec to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Second Pipeline:
from SQ transformation, carry all the ports(including an additional port rownumber generated by rownum or sequence generator) to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Join these 2 pipelines with a Joiner Transformation on common port DUMMY. carry all the source table ports and 2 additional ports tot_rec and rownumber to a router transformation. Add 2 groups in Router : FIRST_HALF and SECOND_HALF. Give condition rownumber<=tot_rec/2 in FIRST_HALF. Give condition rownumber>tot_rec/2 in SECOND_HALF. Connect the 2 groups to 2 different targets.

Monday, 16 July 2012

Data Warehousing


1. what is the difference between database and data warehousing?

Major difference is the historical data which can be maintained in data warehouse instead of keeping it in your real time OLTP(database) system. Also we don't want to decrease the processing time of our OLTP server for taking data for reporting and analysis purpose.

Now why informatica and why other OLAP tools are required :coz companies don't want to spend lots of money in hiring fulltime database developer to write pl/sql packages and procedure to store all these information in ODS. Therefore, most of the time they hire a short-period informatica/ETL developer to make mapping which is GUI based, easy to maintain and easy to run.


Database is the place where the data is taken as a base and managed to get available fast and efficient access.
Data warehouse is the place where the application data is managed for analysis and reporting purposes.

2. what is a derived table in data warehousing?
it is similar to views, it is mainly used to restrict the data
3. What are the methodologies of Data Warehousing.

Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology are stardadly used. Other methodologies are AMM, World class methodology and many more.
Most of the time ,we use Mr. Ralph Kimball methodologies for datawarehousing design.Two kind of schema :star and snow flake.
there r 2 methodologies 1)kimball-first datamarts then EDWH 2)inmon-first EDWH then datamarts from edwh
Regarding the methodologies in the Datawarehousing . They are mainly 2 methods.
1. Ralph Kimbell Model
2. Inmon Model.
Kimbell model always structed as Denormalised structure.
Inmon model structed as Normalised structure.
Depends on the requirements of the company anyone can follow the company's DWH will choose the one of the above models.
In Data warehousing contains the Two Methods
1.       Top Down Method
2.       Bottom up method
In Top Down method: First load the Data marts and then load the data warehouse.
In Bottom Up method: first load the Data warehouse and then load the Data marts.
Top Down approach is first Data warehouse then Data marts.
Bottom up approach is first Data marts then Data warehouse.
There are two approaches in Data ware housing named as
ü  Top Down Approach
ü  Bottom-up Approach
Top down approach in the sense preparing individual departments data (Data Marts) from the Enterprise Data warehouse
Bottom Up Approach is nothing but first gathering all the departments data and then cleanse the data and Transforms the data and then load all the individual departments data into the enterprise data ware house
What is active data warehousing?
An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively. Active data warehousing is all about integrating advanced decision support with day-to-day-even minute-to-minute-decision making in a way that increases quality of those customer touches which encourages customer loyalty and thus secure an organization's bottom line. The marketplace is coming of age as we progress from first-generation "passive" decision-support systems to current- and next-generation "active" data warehouse implementations
Cheers,
Active Dataware house means
Every user can access the database any time 24/7
that is called Active dwh
Active Transformation means data can change and pass

A Data warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated. This makes it much easier and more efficient to run queries over data that originally came from different sources". Another definition for data warehouse is: " A data warehouse is a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favour efficient data analysis and reporting (especially OLAP)". Generally, data warehousing is not meant for current "live" data, although 'virtual' or 'point-to-point' data warehouses can access operational data. A 'real' data warehouse is generally preferred to a virtual DW because stored data has been validated and is set up to provide reliable results to common types of queries used in a business.

Friday, 13 July 2012

Tasks in Informatica


17
The Workflow Manager contains many types of tasks to help you build workflows and worklets. We can create reusable tasks in the Task Developer.
Types of tasks:
Task Type
Tool where task can be created
Reusable or not
Session
Task Developer
Yes
Email
Workflow Designer
Yes
Command
Worklet Designer
Yes
Event-Raise
Workflow Designer
No
Event-Wait
Worklet Designer
No
Timer

No
Decision

No
Assignment

No
Control

No

SESSION TASK
  • A session is a set of instructions that tells the Power Center Server how and when to move data from sources to targets.
  • To run a session, we must first create a workflow to contain the Session task.
  • We can run as many sessions in a workflow as we need. We can run the Session tasks sequentially or concurrently, depending on our needs.
  • The Power Center Server creates several files and in-memory caches depending on the transformations and options used in the session.
EMAIL TASK
  • The Workflow Manager provides an Email task that allows us to send email during a workflow.
  • Created by Administrator usually and we just drag and use it in our mapping.
Steps:
1.    In the Task Developer or Workflow Designer, choose Tasks-Create.
2.    Select an Email task and enter a name for the task. Click Create.
3.    Click Done.
4.    Double-click the Email task in the workspace. The Edit Tasks dialog box appears.
5.    Click the Properties tab.
6.    Enter the fully qualified email address of the mail recipient in the Email User Name field.
7.    Enter the subject of the email in the Email Subject field. Or, you can leave this field blank.
8.    Click the Open button in the Email Text field to open the Email Editor.
9.    Click OK twice to save your changes.
Example: To send an email when a session completes:
Steps:
1.    Create a workflow wf_sample_email
2.    Drag any session task to workspace.
3.    Edit Session task and go to Components tab.
4.    See On Success Email Option there and configure it.
5.    In Type select reusable or Non-reusable.
6.    In Value, select the email task to be used.
7.    Click Apply -> Ok.
8.    Validate workflow and Repository -> Save
  • We can also drag the email task and use as per need.
  • We can set the option to send email on success or failure in components tab of a session task.
COMMAND TASK
The Command task allows us to specify one or more shell commands in UNIX or DOS commands in Windows to run during the workflow.
For example, we can specify shell commands in the Command task to delete reject files, copy a file, or archive target files.
Ways of using command task:
1. Standalone Command task: We can use a Command task anywhere in the workflow or worklet to run shell commands.
2. Pre- and post-session shell command: We can call a Command task as the pre- or post-session shell command for a Session task. This is done in COMPONENTS TAB of a session. We can run it in Pre-Session Command or Post Session Success Command or Post Session Failure Command. Select the Value and Type option as we did in Email task.
Example: to copy a file sample.txt from D drive to E.
Command: COPY D:\sample.txt E:\ in windows
Steps for creating command task:
1.    In the Task Developer or Workflow Designer, choose Tasks-Create.
2.    Select Command Task for the task type.
3.    Enter a name for the Command task. Click Create. Then click done.
4.    Double-click the Command task. Go to commands tab.
5.    In the Commands tab, click the Add button to add a command.
6.    In the Name field, enter a name for the new command.
7.    In the Command field, click the Edit button to open the Command Editor.
8.    Enter only one command in the Command Editor.
9.    Click OK to close the Command Editor.
10.  Repeat steps 5-9 to add more commands in the task.
11.  Click OK.
Steps to create the workflow using command task:
1.    Create a task using the above steps to copy a file in Task Developer.
2.    Open Workflow Designer. Workflow -> Create -> Give name and click ok.
3.    Start is displayed. Drag session say s_m_Filter_example and command task.
4.    Link Start to Session task and Session to Command Task.
5.    Double click link between Session and Command and give condition in editor as
6.    $S_M_FILTER_EXAMPLE.Status=SUCCEEDED
7.    Workflow-> Validate
8.    Repository –> Save
WORKING WITH EVENT TASKS
We can define events in the workflow to specify the sequence of task execution.
Types of Events:
  • Pre-defined event: A pre-defined event is a file-watch event. This event Waits for a specified file to arrive at a given location.
  • User-defined event: A user-defined event is a sequence of tasks in the Workflow. We create events and then raise them as per need.
Steps for creating User Defined Event:
1.    Open any workflow where we want to create an event.
2.    Click Workflow-> Edit -> Events tab.
3.    Click to Add button to add events and give the names as per need.
4.    Click Apply -> Ok. Validate the workflow and Save it.
Types of Events Tasks:
  • EVENT RAISE: Event-Raise task represents a user-defined event. We use this task to raise a user defined event.
  • EVENT WAIT: Event-Wait task waits for a file watcher event or user defined event to occur before executing the next session in the workflow.
Example1: Use an event wait task and make sure that session s_filter_example runs when abc.txt file is present in D:\FILES folder.
Steps for creating workflow:
1.    Workflow -> Create -> Give name wf_event_wait_file_watch -> Click ok.
2.    Task -> Create -> Select Event Wait. Give name. Click create and done.
3.    Link Start to Event Wait task.
4.    Drag s_filter_example to workspace and link it to event wait task.
5.    Right click on event wait task and click EDIT -> EVENTS tab.
6.    Select Pre Defined option there. In the blank space, give directory and filename to watch. Example: D:\FILES\abc.tct
7.    Workflow validate and Repository Save.
Example 2: Raise a user defined event when session s_m_filter_example succeeds. Capture this event in event wait task and run session S_M_TOTAL_SAL_EXAMPLE
Steps for creating workflow:
1.    Workflow -> Create -> Give name wf_event_wait_event_raise -> Click ok.
2.    Workflow -> Edit -> Events Tab and add events EVENT1 there.
3.    Drag s_m_filter_example and link it to START task.
4.    Click Tasks -> Create -> Select EVENT RAISE from list. Give name
5.    ER_Example. Click Create and then done.Link ER_Example to s_m_filter_example.
6.    Right click ER_Example -> EDIT -> Properties Tab -> Open Value for User Defined Event and Select EVENT1 from the list displayed. Apply -> OK.
7.    Click link between ER_Example and s_m_filter_example and give the condition $S_M_FILTER_EXAMPLE.Status=SUCCEEDED
8.    Click Tasks -> Create -> Select EVENT WAIT from list. Give name EW_WAIT. Click Create and then done.
9.    Link EW_WAIT to START task.
10.  Right click EW_WAIT -> EDIT-> EVENTS tab.
11.  Select User Defined there. Select the Event1 by clicking Browse Events button.
12.  Apply -> OK.
13.  Drag S_M_TOTAL_SAL_EXAMPLE and link it to EW_WAIT.
14.  Mapping -> Validate
15.  Repository -> Save.
16.  Run workflow and see.



TIMER TASK
The Timer task allows us to specify the period of time to wait before the Power Center Server runs the next task in the workflow. The Timer task has two types of settings:
  • Absolute time: We specify the exact date and time or we can choose a user-defined workflow variable to specify the exact time. The next task in workflow will run as per the date and time specified.
  • Relative time: We instruct the Power Center Server to wait for a specified period of time after the Timer task, the parent workflow, or the top-level workflow starts.
Example: Run session s_m_filter_example relative to 1 min after the timer task.
Steps for creating workflow:
1.    Workflow -> Create -> Give name wf_timer_task_example -> Click ok.
2.    Click Tasks -> Create -> Select TIMER from list. Give name TIMER_Example. Click Create and then done.
3.    Link TIMER_Example to START task.
4.    Right click TIMER_Example-> EDIT -> TIMER tab.
5.    Select Relative Time Option and Give 1 min and Select ‘From start time of this task’ Option.
6.    Apply -> OK.
7.    Drag s_m_filter_example and link it to TIMER_Example.
8.    Workflow-> Validate and Repository -> Save.
DECISION TASK
  • The Decision task allows us to enter a condition that determines the execution of the workflow, similar to a link condition.
  • The Decision task has a pre-defined variable called $Decision_task_name.condition that represents the result of the decision condition.
  • The Power Center Server evaluates the condition in the Decision task and sets the pre-defined condition variable to True (1) or False (0).
  • We can specify one decision condition per Decision task.
Example: Command Task should run only if either s_m_filter_example or
S_M_TOTAL_SAL_EXAMPLE succeeds. If any of s_m_filter_example or
S_M_TOTAL_SAL_EXAMPLE fails then S_m_sample_mapping_EMP should run.
Steps for creating workflow:
1.    Workflow -> Create -> Give name wf_decision_task_example -> Click ok.
2.    Drag s_m_filter_example and S_M_TOTAL_SAL_EXAMPLE to workspace and link both of them to START task.
3.    Click Tasks -> Create -> Select DECISION from list. Give name DECISION_Example. Click Create and then done. Link DECISION_Example to both s_m_filter_example and S_M_TOTAL_SAL_EXAMPLE.
4.    Right click DECISION_Example-> EDIT -> GENERAL tab.
5.    Set ‘Treat Input Links As’ to OR. Default is AND. Apply and click OK.
6.    Now edit decision task again and go to PROPERTIES Tab. Open the Expression editor by clicking the VALUE section of Decision Name attribute and enter the following condition: $S_M_FILTER_EXAMPLE.Status = SUCCEEDED OR $S_M_TOTAL_SAL_EXAMPLE.Status = SUCCEEDED
7.    Validate the condition -> Click Apply -> OK.
8.    Drag command task and S_m_sample_mapping_EMP task to workspace and link them to DECISION_Example task.
9.    Double click link between S_m_sample_mapping_EMP & DECISION_Example & give the condition: $DECISION_Example.Condition = 0. Validate & click OK.
10.  Double click link between Command task and DECISION_Example and give the condition: $DECISION_Example.Condition = 1. Validate and click OK.
11.  Workflow Validate and repository Save.
12.  Run workflow and see the result.
CONTROL TASK
  • We can use the Control task to stop, abort, or fail the top-level workflow or the parent workflow based on an input link condition.
  • A parent workflow or worklet is the workflow or worklet that contains the Control task.
  • We give the condition to the link connected to Control Task.
Control Option
Description
Fail Me
Fails the control task.
Fail Parent
Marks the status of the WF or worklet that contains the
Control task as failed.
Stop Parent
Stops the WF or worklet that contains the Control task.
Abort Parent
Aborts the WF or worklet that contains the Control task.
Fail Top-Level WF
Fails the workflow that is running.
Stop Top-Level WF
Stops the workflow that is running.
Abort Top-Level WF
Aborts the workflow that is running.

Example: Drag any 3 sessions and if anyone fails, then Abort the top level workflow.
Steps for creating workflow:
1.    Workflow -> Create -> Give name wf_control_task_example -> Click ok.
2.    Drag any 3 sessions to workspace and link all of them to START task.
3.    Click Tasks -> Create -> Select CONTROL from list. Give name cntr_task.
4.    Click Create and then done.
5.    Link all sessions to the control task cntr_task.
6.    Double click link between cntr_task and any session say s_m_filter_example and give the condition: $S_M_FILTER_EXAMPLE.Status = SUCCEEDED.
7.    Repeat above step for remaining 2 sessions also.
8.    Right click cntr_task-> EDIT -> GENERAL tab. Set ‘Treat Input Links As’ to OR. Default is AND.
9.    Go to PROPERTIES tab of cntr_task and select the value ‘Fail top level
10.  Workflow’ for Control Option. Click Apply and OK.
11.  Workflow Validate and repository Save.
12.  Run workflow and see the result.
ASSIGNMENT TASK
  • The Assignment task allows us to assign a value to a user-defined workflow variable.
  • See Workflow variable topic to add user defined variables.
  • To use an Assignment task in the workflow, first create and add the
  • Assignment task to the workflow. Then configure the Assignment task to assign values or expressions to user-defined variables.
  • We cannot assign values to pre-defined workflow.
Steps to create Assignment Task:
1.    Open any workflow where we want to use Assignment task.
2.    Edit Workflow and add user defined variables.
3.    Choose Tasks-Create. Select Assignment Task for the task type.
4.    Enter a name for the Assignment task. Click Create. Then click Done.
5.    Double-click the Assignment task to open the Edit Task dialog box.
6.    On the Expressions tab, click Add to add an assignment.
7.    Click the Open button in the User Defined Variables field.
8.    Select the variable for which you want to assign a value. Click OK.
9.    Click the Edit button in the Expression field to open the Expression Editor.
10.  Enter the value or expression you want to assign.
11.  Repeat steps 7-10 to add more variable assignments as necessary.
12.  Click OK.