I am an Oracle professional with twenty plus years of experience and have deep knowledge and understanding of various facets of Oracle technology including basic Oracle Database Administration, Performance Tuning, Real Application Clusters, Data Guard and Maximum Availability Architecture. I also work a lot in the Oracle Database Cloud space, and I believe that the a lot of the future database work will be on the Cloud.
There are two purposes of this audit:
The primary purpose of the audit is to take a step back and look into the database to see whether there is something that is grossly wrongly configured, or some parameters that has not been optimally set, or that there is something that needs to be fixed immediately on priority or there is any item that needs attention in the short term, so that at the end of it, we have the reassurance that the database has been carefully studied diligently for any obvious shortcomings.
The secondary purpose of the audit is to ensure that we make an attempt at following the documented best practices to the extent practicable as well as try to ensure that we are able to leverage the functionalities of Oracle 11g extensively.
This audit does not claim to cover everything inside the database, but attempts to establish a general health and well-being check of the database, as well as point out any problem areas or potential problem areas.
The ORCLPRD database is a two node RAC Cluster with the combined processing power of the two servers providing greater throughput and Oracle RAC scalability than is available from a single server. The same RAC setup is there in KOLKATA and DELHI.
The ORCLPRD database cluster comprises two interconnected computers or servers that appear as if they are one server to end users and applications. The Oracle RAC option with Oracle Database enables us to cluster Oracle databases. Oracle RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so they operate as a single system.
Basically, the ORCLPRD Oracle Clusterware is a portable cluster management solution that is integrated with the Oracle Database. Oracle Clusterware is a required component for using Oracle RAC that provides the infrastructure necessary to run Oracle RAC. Oracle Clusterware also manages resources, such as Virtual Internet Protocol (VIP) addresses, databases, listeners, services, and so on. In addition, Oracle Clusterware enables both non-clustered Oracle databases and Oracle RAC databases to use the Oracle high-availability infrastructure. Oracle Clusterware along with Oracle Automatic Storage Management (Oracle ASM) (the two together comprise the Oracle Grid Infrastructure enables us to achieve High Availability of the Oracle database). We have extensively used these features to setup the ORCLPRD database.
The Oracle Database with the Oracle Real Application Clusters (RAC) option allows running multiple database instances on different servers in the cluster against a shared set of data files, also known as the database. The database spans multiple hardware systems and yet appears as a single unified database to the application. This enables the utilization of commodity hardware to reduce total cost of ownership and to provide a scalable computing environment that supports various application workloads. Oracle RAC is Oracle’s premier shared disk database clustering technology.
The basic database version and configuration is set out below:
|Name of the Database||ORCLPRD|
|Name of the Instances||ORCLPRD1, ORCLPRD2|
|Environment Setting||Set . oraenv to ORCLPRD1 or ORCLPRD2|
The Operating System version is as follows:
The TPC Real Application Cluster setup is a two node RAC cluster, as shown in the screenshot below:
Oracle Clusterware includes a high availability framework that provides an infrastructure to manage any application. Oracle Clusterware ensures that the applications it manages start when the system starts and monitors the applications to ensure that they are always available. If a process fails then Oracle Clusterware attempts to restart the process using agent programs (agents). Oracle clusterware provides built-in agents so that we can use shell or batch scripts to protect and manage an application. Oracle Clusterware also provides preconfigured agents for some applications (for example for Oracle TimesTen In-Memory Database).
If a node in the cluster fails, then we can program processes that normally run on the failed node to restart on another node. The monitoring frequency, starting, and stopping of the applications and the application dependencies are configurable.
We also notice that the Cluster Time Synchronization daemon is running in OBSERVER mode, which implies that we are using NTP to synchronize time amongst the nodes in the RAC cluster.
See the screenshot below:
We have followed documented clusterware best practices, namely:
Following is the main memory configuration for the ORCLPRD database. The exact same configuration exists for both KOLKATA and DELHI nodes. All memory configurations have been done assuming a high load on the database. Automatic Memory Management is enabled on the database. For doing it we have set (MEMORY_TARGET) and a maximum memory size initialization parameter (MEMORY_MAX_TARGET). The total memory that the instance uses remains relatively constant, based on the value of MEMORY_TARGET, and the instance automatically distributes memory between the system global area (SGA) and the instance program global area (instance PGA). As memory requirements change, the instance dynamically redistributes memory between the SGA and instance PGA.
|memory_max_target||50||The maximum memory that will ever be available to the system.|
|memory_target||50||The memory that is currently accessible to the system.|
|SGA_Target||0||The SGA_TARGET parameter has not been set because we are using AMM.|
|PGA_aggregate_limit||0||This parameter need not be set because we are using AMM.|
|Database Buffer Cache||22||Current size of the buffer cache|
|Result _cache_max_size||128 M||The size is kept pretty small, as in general we don’t plan to use Result Cache in our setup.|
A system global area (SGA) is a group of shared memory structures that contain data and control information for one Oracle database instance. If multiple users are concurrently connected to the same instance, then the data in the instance’s SGA is shared among the users. Consequently, the SGA is sometimes called the shared global area.
An SGA and Oracle processes constitute an Oracle instance. Oracle automatically allocates memory for an SGA when you start an instance, and the operating system reclaims the memory when you shut down the instance. Each instance has its own SGA.
The SGA is read/write. All users connected to a multiple-process database instance can read information contained within the instance’s SGA, and several processes write to the SGA during execution of Oracle.
The SGA contains the following data structures:
Part of the SGA contains general information about the state of the database and the instance, which the background processes need to access; this is called the fixed SGA. No user data is stored here. The SGA also includes information communicated between processes, such as locking information.
Let’s check the total memory configuration along with the SGA. The total memory allocated for the system is 50GB, but we have not set any value for the SGA_TARGRET, which implies that we are letting Oracle determine the size of the SGA according to its needs. This is technically called Automatic Memory Management or AMM:
Note: If we are using AMM or ASMM then it is imperative that /dev/shm is set to a size at least equal to or greater than the MEMORY_TARGET parameter. In our case, this requirement has been fulfilled. While /dev/shm is set to 127G, the MEMORY_TARGET is capped at 50G.
Recommendation: Where the SGA is greater than 8G, as is the case here, then the recommendation is to use is to use HugePages. Ref:
This is something that needs to be seriously considered as configuring HugePages has a definite increase in performance based on the fact that without HugePages, the operating system keeps each 4 KB of memory as a page. When it allocates pages to the database System Global Area (SGA), the operating system kernel must continually update its page table with the page lifecycle (dirty, free, mapped to a process, and so on) for each 4 KB page allocated to the SGA.
With HugePages, the operating system page table (virtual memory to physical memory mapping) is smaller, because each page table entry is pointing to pages from 2 MB to 256 MB.
Also, the kernel has fewer pages whose lifecycle must be monitored. For example, if you use HugePages with 64-bit hardware, and you want to map 256 MB of memory, you may need one page table entry (PTE). If you do not use HugePages, and you want to map 256 MB of memory, then you must have 256 MB * 1024 KB/4 KB = 65536 PTEs.
HugePages provides the following advantages:
Note: To set up HugePages we need to disable AMM (Automatic Memory Management). This constraint is not there if we are using ASMM (Automatic Shared Memory Management), which is automatically enabled if we are setting the size of SGA to any value greater than 0.
As discussed earlier in the report, since Automatic Memory Management is enabled on the system, the parameter PGA_AGGREGATE_TARGET is also set to 0.
Let us check the PGA performance:
The PGA cache hit percentage is 96%, which is pretty decent, and at the current workload, and there is no PGA related problem on the database.
Recommendation: There are no additional recommendations regarding the PGA configuration as it was has been optimally set up.
Recommendation: There are no additional recommendations with regard to ASM memory.
Background: When Automatic Storage Management was introduced in Oracle 10gR1, it was simply marketed as the volume manager for the Oracle database. ASM was designed as a purpose built host based volume management and filesystem that is integrated with the Oracle database. It is built on the Striping and Mirroring technology (SAME), which stripes and mirrors disks across as many disks as possible and provides the ability of rebalancing the file layout online whenever the physical storage configuration changes.
From the user’s perspective, ASM exposes a small number of disk groups. These disk groups consist of ASM disks and files that are striped across all the disks in a disk group. The disk groups are global in nature and database instances running individually or in clusters have a shared access to the disk groups and the files within them. The ASM instances communicate amongst themselves and form the ASM cluster.
Let us see the current disk group setup and the free space available:
|Disk Group Name||Total Size (GB)||Currently Available Size (GB)|
So we have 4 disk groups with ample free space available. Now, let us look into the size of the disks that make up the diskgroup. The test is to confirm whether we have equi-sized disks in a disk group. As we can see from the screenshot below, all the disks in a particular disk group are equi-sized. This is important both from a stability as well as performance perspective.
Next we test whether the COMPATIBLE.ASM value is set to 11.1 or higher for the disk group to display in the V$ASM_ATTRIBUTE view output. Attributes that are designated as read-only (Y) can only be set during disk group creation.
Similarly, we check the software compatibility for each disk group. These are also properly configured:
We check the ASM_POWER_LIMIT parameter. This parameter gives us an idea of how fast the rebalance operation can take place after performing a (planned) maintenance. Until Oracle 184.108.40.206, the default value is 1 and the maximum value is 11, but from Oracle 220.127.116.11 the maximum value is 1024. We see from the following screenshot that we are using the default rebalancing value.
The number of LUNs (Oracle ASM disks) for each disk group should be at least equal to four times the number of active I/O paths**. For example, if a disk group has two active I/O paths, then minimum of eight LUNs should be used. The LUNs should be of equal size and performance for each disk group. Re: Oracle ASM and Multi-Pathing Technologies” in My Oracle Support Note 294869.1—this needs to be checked with the Storage Administrator.
**An I/O path is a distinct channel or connection between storage presenting LUNs and the server. An active I/O path is an I/O path in which the I/O load on a LUN is multiplexed through multipathing software.
It is strongly recommended that all Oracle ASM disks in a disk group have similar storage performance and availability characteristics. In storage configurations with mixed speed drives, such as flash memory and hard disk drives (HDD), I/O performance is constrained by the slowest speed drive. Re: https://docs.oracle.com/cd/E11882_01/server.112/e18951/asmprepare.htm#OSTMG02550
First, let us check the tablespaces, their associated datafiles, and the size of such files:
It is noticed that while most of the tablespaces are autoextensible, there are several of them which are not autoextensible. Additionally there are files added to the tablespace which are randomly sized. So it shows that tablespaces are NOT being optimally managed.
Let us see the free space available in the tablespaces. Space is being shown in GB:
Space is pretty much available in all the tablespaces, and as mentioned above since the tablespaces are autoextensible, space management is well taken care of in this system.
There are a couple of recommendations here:
Controlfiles are multiplexed and they are kept at separate locations, but both inside ASM, once copy in the ORCL_FRA diskgroup, while the other copy is in ORCL_REDO diskgroup.
There are 4 redo log groups created each having 2 members each as can be seen from the screenshot below:
And the size of each logfile is 2048 MB, as can be seen below:
Let us take a sample of the number of log switches made in an hour:
At this point we can see the number of switches is just 3 per hour which is just fine.
Both the controlfiles are multiplexed in ASM-it can be considered to have one copy in ASM and the other on the Unix filesystem or consider keeping the same configuration, but have an additional multiplexing of the controlfile in the Unix filesystem.
Redo Logs have been configured appropriately and there are sufficient number of groups that have been created to ensure that we should not ever encounter the dreaded error:
“Checkpoint not complete, cannot allocate new log” message
Below is the Network and SCAN configuration for the KOLKATA site
SCAN name: epedbp1-scan, Network: 1
|Type of IP||IP Address||Remarks|
|SCAN IP 1||10.XX.26.102||Enabled|
|SCAN IP 2||10.XX.26.103||Enabled|
|SCAN IP 3||10.XX.26.104||Enabled|
|Client IP-Node 1||10.XX.5.187||Enabled|
|Client IP-Node 2||10.XX.5.155||Enabled|
Let Let us check whether the round robin algorithm is working while resolving the SCAN name. The round robin algorithm implies that SCAN name resolves to a different IP address each time an attempt is made to connect to the database. This is Load Balancing at the connection level.
Additionally we will run a check on how the SCAN listeners are placed on both the nodes. Since there are three SCAN listeners, typically two of them ought to be placed on one node, while the other will be the remaining node (in the case of a two node RAC configuration). In an ideal situation there should not be any deviation from a configuration of this sort.
Finally, let’s have a look at the VIPs
Thus, network and SCAN have been properly set up on the ORCLPRD database.
The DNS server which resolves the SCAN name should be properly secured in the sense that only authorized personnel should have access to the server. Additionally, there should be a backup of the DNS server, in case there is a requirement for maintenance on the main DNS server, so that there is no interruption of service.
As a best practice the round robin algorithm ought to be enabled on the DNS server, so that connect time load balancing is automatically occurring and the connections are equitably distributed amongst the servers.
Finally, SCAN should be leveraged by using the SCAN name to connect instead of using physical IPS or VIPs. It was noticed that several connections were actually made using VIPS and other IPs-this should be strongly discouraged.
NO CHANGES ON THE NETWORK SHOULD BE MADE WITHOUT TAKING THE CONCERNED DBAs INTO CONFIDENCE, AS NETWORKS IN A REAL APPLICATION CLUSTER CONNECTION IS A MAJOR COMPONENT OF A PROPERLY FUNCTIONING CLUSTERED DATABASE SOLUTION.
There seem to be no log rotation done in the case of alert logs. The alert log has grown pretty big as can be seen in the screenshot below:
the Listener logs which are in .xml format. Being in XML format, many tools now can be made to read the files unambiguously since the data is now enclosed within meaningful tags. Additionally the listener log files (the XML format) is now rotated. After reaching a certain threshold value the file is renamed to log_1.xml and a new log.xml is created – somewhat akin to the archived log concept in the case of redo log files. So we are relieved of the task of deleting old listener logs lest they start filling up the system. See the screenshot below:
However, Oracle still continues to write a version of the Listener logs in the old format. The old format log is still called listener.log but the directory it is created in is different – $ADR_BASE/tnslsnr/Hostname/listener/trace. Unfortunately there is no archiving scheme for this file so this simply kept growing, and this has to be manually trimmed.
Similarly, as part of the housekeeping, the tracefiles (*.trc) which are generated on the system need to be trimmed from time to time. There is no automatic mechanism to delete these files on the system.
There is a cron job running which deletes old Listener logs, but
The basic backup configuration scheduled through the Oracle Job Scheduler is as follows:
The following has been set up in cron
The RMAN configuration is set to keep one backup in addition to the current backup, as can be seen from the screenshot below:
The table MESSAGELOG has interval partitioning enabled on it (creating a new partition everyday) and an auto-dropping of older partitions coded on it.
This allows for optimization of space usage and ease of administration, as long as we keep a small number of tables under this scheme. Currently we are keeping about three days’ worth of data,-we may sometimes see four partitions, this happens because when the new partition has already been created, but the older partition’s scheduled drop time has not arrived as yet.
The Unix filesystem is shown below:
/orasw needs to be cleaned up as the available space on that directory is only about 17G and the directory is 78% used already. There seems to be no automatic deletion script in place and this could be a major issue in the scheme of things.
Alert logs, tracefiles, Listener logs all need to be deleted from time to time. Using the nifty Unix utility Logrotate to do it is a very good way of taking care of unwanted runaway growth on the database.
In the ORCLPRD database, alert logs, tracefiles and listener logs are NOT being managed that way. This need to be taken care of. There is an observation regarding backups:
IT IS STRONGLY SUGGESTED THAT RESTORE AND RECOVERY ARE BOTH TESTED OUT-BOTH FROM THE ASM LEVEL AS WELL AS FROM THE NET BACKUPS.
Below is a set of generic parameter settings which are considered the most important ones from a RAC database optimization and performance management perspective. This list of parameters are by no means an exhaustive list, and it is suggested that one refer to the Oracle documentation (specifically to the Performance Tuning Guide and the Administrator’s Guide) for further information on parameter settings. The parameter settings need to be first tested on a Lab setup before being implemented in the Production setup.
Instance recovery is the process of recovering the redo thread from the failed instance. Instance recovery is different from crash recovery, which occurs when all instances accessing a database have failed. Crash recovery is the only type of recovery when an instance fails using a single-instance Oracle Database.
When using Oracle RAC, the SMON process in one surviving instance performs instance recovery of the failed instance.
In both Oracle RAC and single-instance environments, checkpointing is the internal mechanism used to bind Mean Time To Recover (MTTR). Checkpointing is the process of writing dirty buffers from the buffer cache to disk. With more aggressive checkpointing, less redo is required for recovery after a failure. The MTTR on ORCLPRD database is not set. Enable MTTR Advisory by setting FAST_START_MTTR_TARGET to a value greater than zero
The recommendation would be to set it to 300.
Please refer http://www.oracle.com/goto/maa for best practice.
The FAST_START_PARALLEL_ROLLBACK parameter determines how many processes are used for transaction recovery, which is done after redo application. Optimizing transaction recovery is important to ensure an efficient workload after an unplanned failure. If the system is not CPU bound, setting this parameter to HIGH is a best practice. This causes Oracle to use four times the CPU_COUNT (4 X CPU_COUNT) parallel processes for transaction recovery. The default setting for this parameter is LOW, or two times the CPU_COUNT (2 X CPU_COUNT). See the screenshot below:
Recommendation: consider setting the parameter fast_start_parallel_rollback parameter to high using:
ALTER SYSTEM SET FAST_START_PARALLEL_ROLLBACK=HIGH SCOPE=BOTH;
When FAST_START_PARALLEL_ROLLBACK is set to HIGH, a system with a large number of CPUs will spawn a lot of parallel recovery slaves which can substantially increase the IOPS rate. In this case the system should not be challenged for I/O before FAST_START_PARALLEL_ROLLBACK is set to HIGH.
The ORCLPRD system is not quite CPU bound as we can see from the following screenshot:
Under most circumstances, Oracle Database automatically detects if asynchronous I/O is available and appropriate for a particular platform and enables asynchronous I/O through the DISK_ASYNCH_IO initialization parameter. However, for optimal performance, it is always a best practice to ensure that asynchronous I/O is actually being used. Query the V$IOSTAT_FILE view to determine whether asynchronous I/O is used. In the case of ORCLPRD:
In our case we are already well below the 64 MB threshold:
However, without knowing the transaction rates, the sizes of the transactions, the number of users etc. it is difficult to say whether this in itself would be a problem, but it would be good to be above the 64 MB threshold to be on the safe side.
For any systems with 4 GB or more memory, disable Automatic Memory Management by setting MEMORY_TARGET=0 and enable Automatic Shared Memory Management by setting SGA_TARGET.
Let us check this on our system:
Since MEMORY_TARGET is set but SGA_TARGET is not set, this means that Automatic Memory management (AMM) is set on the system. As a best practice, it is being suggested that Linux HugePages ought to be used if the SGA > 8GB, which is the case here, then as a first step we should consider moving to Automatic Shared Memory Management (ASMM) from Automatic Memory Management (AMM).
Additionally, the sum of SGA and PGA memory allocations on the database server should always be less than our system’s physical memory, and conservatively should be less than 75% of total system memory. However, PGA_AGGREGATE_TARGET is not a hard limit, and for some Data Warehouse or reporting applications, the PGA memory can grow to be
3 X PGA_AGGREGATE_TARGET.
It is suggested that we monitor PGA memory and host-based memory utilization using Oracle Enterprise Manager, or by querying v$pgastat and operating systems statistics, to get an accurate understanding of memory utilization.
It is strongly recommended to move towards HugePages so that ASM and database instances can use it for their SGA. HugePages is a feature integrated into the Linux kernel from release 2.6. This feature provides the alternative to the 4K page size providing bigger pages. Using HugePages has the benefit of saving memory resources by decreasing page table overhead while making sure the memory is not paged to disk. This contributes to faster overall memory performance. Next to this overall node stability will benefit from using HugePages.
Ensuring the entire SGA of a database instance is stored in HugePages can be accomplished by setting the init.ora parameter use_large_pages=only. Setting this parameter will ensure that an instance will start only when it can get all of its memory for SGA from HugePages. For this reason the setting use_large_pages=only is recommended for database instances.
For ASM instances leave use_large_pages=true (the default value). This setting still ensures that HugePages are used when available, but also ensures that ASM as part of Grid Infrastructure starts when HugePages are not or insufficiently configured.
CURSOR_SHARING=SIMILAR has been deprecated as per MOSC Note 1169017.1. Starting Oracle 11g, Oracle implemented a more intelligent solution known as adaptive cursor sharing making cursor_ sharing=similar obsolete.
So unless we have a very strong reason to do otherwise, it is best to stick with CURSOR_SHARING=EXACT
This parameter setting should be TYPICAL or FULL on all the databases. In our case it is set to TYPICAL, as can be seen from the following screenshot.
DB_BLOCK_CHECKSUM determines whether
n and the direct loader will calculate a checksum (a number calculated from all the bytes stored in the block) and store it in the cache header of every data block when writing it to disk. Checksums are verified when a block is read – only if this parameter is
FULL and the last write of the block stored a checksum. In
FULL mode, Oracle also verifies the checksum before a change application from update/delete statements and recomputes it after the change is applied. In addition, Oracle gives every log block a checksum before writing it to the current log.
If this parameter is set to
n calculates checksums only for the
SYSTEM tablespace, but not for user tablespaces.
Checksums allow Oracle to detect corruption caused by underlying disks, storage systems, or I/O systems. If set to
DB_BLOCK_CHECKSUM also catches in-memory corruptions and stops them from making it to the disk. Turning on this feature in
TYPICAL mode causes only an additional 1% to 2% overhead. In the
FULL mode it causes 4% to 5% overhead. Oracle recommends that you set
TYPICAL. For backward compatibility we preserve use of
Oracle Enterprise Manager Cloud Control enables us to use the Oracle Enterprise Manager console interface to discover Oracle RAC database and instance targets. We can use Oracle Enterprise Manager to configure Oracle RAC environment alerts. We can also configure special Oracle RAC database tests, such as global cache converts, consistent read requests, and so on.
Oracle Enterprise Manager distinguishes between database- and instance-level alerts in Oracle RAC environments. Alert thresholds for instance-level alerts, such as archive log alerts, can be set at the instance target level. This function enables us to receive alerts for the specific instance if performance exceeds our threshold. We can also configure alerts at the database level, such as setting alerts for tablespaces, to avoid receiving duplicate alerts at each instance. We can define blackouts (which are time periods in which database monitoring is suspended so that maintenance operations do not skew monitoring data or generate needless alerts) for all managed targets of an Oracle RAC database to prevent alerts from occurring while performing maintenance. So overall, we can leverage OEM to provide better manageability and administration of a database, clustered or otherwise.
My Oracle Support provides patch recommendations for the following product lines, if we have a valid Customer Support Identifier Number.
Starting with the October 2013 Critical Patch Update, security fixes for Java SE are released under the normal Critical Patch Update schedule.
A pre-release announcement will be published on the Thursday preceding each Critical Patch Update release. Oracle will issue Security Alerts for vulnerability fixes deemed too critical to wait for distribution in the next Critical Patch Update.
Critical Patch Updates are collections of security fixes for Oracle products. They are available to customers with valid support contracts. They are released on the Tuesday closest to the 17th day of January, April, July and October. The next three dates are:
and the last date was 18 October 2016.
Additionally, we can also check for recommended patches, security updates, and bug fixes in the My Oracle Support page, and apply those from time to time.
We ran the orachk tool on the database. This tool analyzes all the information inside the database, and amongst other things provides a report card with a score. In general, any score above 85 over 100 is considered a good score and a validation that the database is functioning properly. See the screenshot below:
Please find below a summary of recommendations based on the audit done on the database. While these recommendations are made after due diligence and consideration, it is advised to do a test on a lab setup wherever applicable.