2009/06/01

size sometimes does matter...

Not really. Well... You know what I mean. ;)

Anyways, sorry for the "catchy" title. This entry is prompted by some of the ongoing exercise in fine tuning our AIX db servers.

AIX is a great OS but the information needed to eek out the last ounce of performance for Oracle dbs is sparse, or across many documents. This blog entry is for my future reference on how to make Oracle use large pages in AIX. If it helps you, then even better!


One of the ways modern servers differ from older models is in the use of 64-bit memory and very large physical memory sizes.

However, traditional virtual memory uses 4KB physical page sizes in Unix. These pages have to be managed by the OS memory management code.

This is usually achieved through a mechanism referred to as address translation. In as simple terms as I can put it, the OS needs to know the status of every page making up its physical addressable memory in order to provide the virtual memory we all know and love.

Now, when we are talking about physical memory sizes of hundreds of Gigabytes, it is not hard to comprehend that managing those in terms of "chunks" - pages - of 4K bytes leads to a LOT of overhead.


Think of it this way: how many 4K pages fit into an addressable space of, say, 64GB?
I'll save you the time: in excess of 15 million entries.


It's not hard then to reason that some overhead will result from managing such large tables of memory pages.

This is typically indicated by excess CPU usage being noticed in kernel mode. Anything over 5% is usually a good indication that something may be amiss with memory management. Note that I said: "may be". It's not the only cause, but it's worth checking.

The problem is made worse by the simple fact all this activity is pre-emptive of any other processing. The OS simply cannot make CPU available to user processes until it has finished managing memory! A good indication of this type of event is when a relatively high kernel mode CPU usage is noted, together with unexplained wait time.

The important thing to note here is that very large numbers of pages can pose serious loads to an OS in terms of memory management. Add-in the occasional paging activity and/or the need to keep database buffer structures - such as the Oracle SGA - locked in memory and you have the recipe for serious memory management overhead.

So, what is the solution? We NEED those Gigabytes!

Many have been tried over the years. The most common is to increase the page size from the default of 4K to a much larger number. The bigger the page size, the less of them needed to represent very large physical memory, the less overhead in maintaining and managing said pages.

Simple arithmetic.

I like simple.


With AIX, the common use bigger page size is 16MB. These are commonly called "largepages". Large pages can coexist in memory with smaller ones: you don't need to re-define the entire adressable memory because not every process is a gigantic memory hog like Oracle - nothing wrong with that, it's the way dbs are supposed to operate, DB2 and others do exactly the same.

The number of largepages available in a system is a tunable parameter. In a nutshell, one reserves a large chunk of physical memory to be managed with large page algorithms. Then certain programs use that memory.

A simple yardstick for how much physical memory to reserve for large pages is to set aside 75% for their use. Say for example you have a 64GB system? Then reserve around 45GB for large pages.

If you are using 16MB page sizes, that is around 2800 pages. Instead of over 11 million 4KB pages. This is where the overhead of memory management gets a MAJOR reduction!


Another advantage of using large pages for databases is that by definition, they are not pageable. With Oracle, the SGA can be allocated using largepages and it will never get paged out. That has a distinct advantage in loaded systems: very large numbers of hard page faults in a busy db server are simply a no-no.

OK, so how do we make all this happen?

There are various documents produced by Oracle and IBM on this subject, at various times and various releases of the software. It took a while to distill from them what exactly needs to be done.

As usual, the level of information from Oracle on this subject can only be described as "sparse"...
Let's just say it is resumed to the usual: "lock sga in memory".

Yeah! Right...

Exactly HOW?

Well, setting the corresponding initialization parameter is a good start:

alter system set lock_sga=true scope=spfile;


But there is more. Much more...

Here is what you need to do for the following conditions:

AIX release: 5L, release 5.3
Oracle release: 10.2.0.3

(things might be different for other combinations of major releases, do some research on that. Metalink is a good starting point)

First, check out how the virtual memory is configured in your system:

$sudo vmo -a
among the resulting text, you will see something like this:
lgpg_regions = 0
lgpg_size = 0
lru_file_repage = 1
v_pinshm = 0


OK, we need to change these. Assuming the sizes I mentioned above, you need to tell AIX to set aside a certain amount of physical memory for large pages.

Now, BE CAREFUL!!!!

If you ask AIX to reserve those pages on a system that has a current instance of Oracle running, AIX will happily page out the 4K-page SGA while making up space for the large pages. With dire consequences for anything running in said SGA!

So: do NOT do this on a system where you are currently running a database!

Shut it down FIRST, to free up the memory for large pages!

Got it? Don't say I didn't warn you!

OK, let's then set the memory aside:
$ sudo vmo -p -o v_pinshm = 1 -o lru_file_repage=0
$ sudo vmo -p -o lgpg_size=16777216 -o lgpg_regions=2800

(you don't have to do all the changes in one line, multiple calls to vmo are OK)

Is that all you need to do? Narh, it couldn't be that simple, could it?

Unfortunately, a few other things are necessary...


First of all, we have to give the Oracle dba login the capabilities to lock memory and use large pages. This is how you do it, assuming your login is "oracle":

$sudo chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE oracle
and you can check it with:
$sudo lsuser -a capabilities oracle


Next, we enable the Oracle executable to use large pages. If you have re-linked Oracle in a system with largepages enabled, this is already done. If not, then just do it. AFAIK this cannot be checked, simply do it every time to be sure:

$cd $ORACLE_HOME/bin
$sudo ldedit -b lpdata oracle

After this, check that the oracle executable still has the required protection flags set. If not, then set them:

$chmod 6751 oracle


Next, we need to set an environment variable in the .profile of the oracle login:

export LDR_CNTRL=LARGE_PAGE_TEXT=Y@LARGE_PAGE_DATA=M


the above indicates to AIX that processes started by login "oracle" do indeed use largepages if they are available and usable. Do not use quotes around the string, just type it like above in the .profile file.
Yeah, I know it looks weird. Hey: blame IBM!

One final bit of tuning with vmo. If you are using asynchronous I/O - and who isn't? - you may want to make sure AIX reserves very little of its memory for file system cache and allows for very large sizes of pinned memory. This is how you do it:

$sudo vmo -p -o minperm%=5 -o maxperm%=90 -o maxpin%=80


And that's about it! (whew!)

You should see something similar to this on vmstat once you re-start oracle:

$ vmstat -l 30
System configuration: lcpu=8 mem=65536MB ent=2.00
kthr memory page faults cpu large-page
----- ----------- ------------------------ ------------ ----------------------- -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec alp flp
1 1 10873944 232145 0 0 0 0 0 0 610 2889 1997 5 2 84 9 0.14 7.1 2274 226


See those last two columns? In this case , I am using <40GB of large pages, with a little bit in reserve for the odd Oracle process I need to start locally - SQL*Plus, rman, whatever. The "flp" column (free large pages) will hover between 200 and 100 pages free all day.

That's fine, I like a little bit of margin left. Just in case...


Now before you take off doing this on all your AIX systems, some "sanity" disclaimers:

1- This is NOT gospel, I don't do religion. Examine your particular case, see if it makes sense to use large pages, then test all this first!

2- Large pages can be reserved dynamically, but they come out of total physical memory: you have to use the material between your ears to come up with a reasonable and workable number to start with. Don't take all of physical memory with large pages! And do it when not much else is running.

3- This is for AIX5L release 5.3, Oracle 10.2.0.3. Don't come back to me and whinge that it ain't working with Oracle 8.1.7/AIX 4.4 or Oracle11.1.0.7/AIX 6.2! Use your brains and do some research before going off on tangents!

In simple terms: the above is guideline information on how to go about this exercise.
It is NOT an exhaustive description of all nuances and combinations possible. You MUST use your knowledge of your systems and adapt it for your case.



Anyways, enjoy.

I'll get into some photos later, this entry is for my own future reference, I want to keep it technical.

Hopefully, it might save someone else a lot of searching for the solutions or a lot of external consultancy hours.

It's a pity this type of information is not easily available from Oracle and/or IBM, in a condensed form. I guess their professional services have to make a living?...


(Addendum from the comments section)


I made a reference to how Oracle uses AIX's Concurrent I/O (CIO) in one of the replies in the comments.

CIO is an option of JFS, available in the latest releases of AIX. It allows Oracle to bypass AIX buffer cache and use asynchronous I/O as well as avoiding single-threading on the Unix file system lock. Google it and you'll find plenty of details on what it does.

CIO is normally enabled at file system mount time, with the "cio" option.

With JFS2 - available from AIX 5 onwards - Oracle 10g and later releases can actually use a file-open flag that turns on CIO even if the file system was mounted without it. The flag symbolic name is O_CIO.

This is of course highly desirable: instead of having a whole file system in CIO mode, Oracle will only use it for its files, where it is of immediate benefit. Other non-Oracle files in that file system will benefit from buffered AIX I/O.

To signal AIX Oracle 10g onwards to use the O_CIO flag, you have to set the initialization parameter:

filesystemio_options = SETALL

Because this is a new feature, it is possible that other "features" (work with me here!...) may be present in the specific point release or patch level you are using.
10.2.0.3 doesn't have a problem but as Patty C. pointed out in the comments, 10.2.0.4 does - there is now a patch, check the reply.

How do you find out if Oracle is using the flag or not? Easy: trace the dbwr process.

How? Once again, easy:

(Warning: dba1.0 command line stuff coming up.
If you only use dba 2.0 GUI tools to manage your db, I'm afraid you're out of this one: keep clicking - and hoping all is well...)

Start the database in mount mode in a terminal session.

Then get the process id (with ps -ef|grep dbwr) of the dbwr process.

Use the process id to "truss" the dbwr process in a separate terminal session, while in the original you mount the database and then open it. Check the flags in the output of truss for each of the db files.

This is a typical output:

$truss -a -t open -p 12390536
open("/u31/oracle/oradata/dgprod/control01.ctl", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 14
...
open("/u31/oracle/oradata/dgprod/system01.dbf", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 16
...
open("/u31/oracle/oradata/dgprod/undotbs01.dbf", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 17
...
etcetc.

As you can see, the O_CIO option is there in the open call and that's what opens files in CIO mode.

If you try this and notice (and become curious as to why) Oracle first opens all files in read-only mode before opening in read-write (O_RDWR) mode, the answer is:
it is how it checks the files are there! Any missing file will cause the open to fail and Oracle will not open the entire database.

Have fun.

13 Comments:

Anonymous Amit said...

Hi,

In your article you have set maperm to 90% and at same time you mentioned that you want to reserve very little memory for file cache. I believe by setting this value, you tell OS to keep 90% of memory for file caching. Shouldn't this value be reduced..say 40 %..

Cheers
Amit

Monday, June 01, 2009 4:57:00 pm  
Blogger Noons said...

Thanks Amit. No, not really.

Have a look at http://www.ibm.com/developerworks/aix/library/au-aixoptimization-memtun1/index.html?ca=drs- and read the section dealing with that parameter, right before "Listing 2".

The way those parameters worked changes completely with lru_file_repage in 5L. That parameter gives preference to "losing" pages from the file cache, so you need to start from something to lose those pages. That's why they recommend maxperm high and minperm low.
Check out the article on that link.
There is also a paper on specifics of AIX 5L for Oracle 10g that points out exactly that difference.
I'll see if I can find the link to the original and will post it.

Still: nothing to do with largepages specifically. I make these changes to ensure that 4K-page memory, left over after the big chunk of large pages is taken over, is not overcommitted to file cache.

Tuesday, June 02, 2009 12:12:00 am  
Anonymous Amit said...

Noons,

I have seen the IBM link given by you. But I have personally seen environments using CIO enabled FS or ASM database's where having high maxperm (lru_file_repage set to 0) is still a overhead as Oracle Processes has to steal memory for filesystem cache (some applications might be using FS cache). Definitely lru_file_repage is a very important parameter and should be set to 0 for servers running oracle database.

And Yeah this has nothing to do with Large pages :)

Cheers
Amit

Tuesday, June 02, 2009 2:34:00 am  
Blogger Noons said...

Let me dispel something upfront: there is no such thing as a need for "cio-enabled" file system with jfs2 and Oracle 10gr2 onwards.

Oracle 10gr2 will use CIO even if we don't mount the file system as such, if filesystemio_options=setall.

Provided we are running AIX 5.3 with jfs2.

You likely already know that, I'm just noting it in case anyone reading thinks they need to mount their FS with CIO for it to be used.


Now, back to memory.

Oracle processes using large pages do NOT need to steal memory from file system cache: they already got it!

Largepage memory space is pre-reserved, non-pageable. All virtual memory page management is active only for the REMAINING 4K pages.

No more file system cache stealing whatsoever, as far as Oracle processes are concerned!

Try getting an SGA and Oracle code to largepages and then force a steal while monitoring with vmstat. It's then very clear what happens.


Coming back to fs cache, you well point out that some applications might be using FS cache. That is the case with mixed systems, where you run an oracle database PLUS some other software in the same system.


Now, the interesting bit as noted above is that when you are using largepages, you can completely ignore the need to tune FS cache/Oracle cache competition contention!


Largepages are allocated once and stay in physical memory all the time: all Oracle-specific IO goes to them - assuming of course one is using "setall" and as a corollary, CIO.

FS cache and virtual memory management now have the remaining 4K-paged physical memory to play with, but will never steal pages from Oracle's largepage pool.


Now, THAT is the beauty of this! You can literally "partition" your physical memory with Oracle SGA managing one big chunk of it without any other overheads, while the rest works as expected and carries on with the usual beg/borrow of virtual memory.


Coming back to your very important point, I'll try to explain why I set maxperm high.

Maxperm basically does this: "If FS cache pages are larger than maxperm, steal only FS cache pages. Otherwise IF you steal more FS cache than computational pages, start stealing computational pages as well".

Now, that is EXACTLY what I want it to do if my SGA is locked away in largepage memory and out of reach of FS cache and I have other processes using remaining 4k-paged memory!

I want file cache to take over all normal 4K-paged memory - if needed - and surrender it to processes using 4K pages as needed, while not overexercising the LRUD page replacement process.

For that to happen I need to set maxperm high, as in any conventional non-oracle AIX system.



If I had maxperm set low, I'd be forcing the page replacement mechanism to reduce FS cache, for no purpose whatsoever: I already have all the memory I need for Oracle locked away in largepages!


For all intents and purposes, I end up with a "bipolar" memory: one big chunk that is fixed and used by Oracle SGA and Oracle code pages, another chunk that behaves as a normal AIX virtual memory system, with minimal overhead.


The best of both worlds? Dunno, but it seems to work rather well: I haven't had a single pageout of SGA since using this model and the overall throughput of the db servers has increased dramatically for all processes. The paging partition hardly ever gets used.



Thanks heaps for your very pertinent comments, hopefully I've been able to explain the reason for the choices we've taken.

Tuesday, June 02, 2009 9:01:00 am  
Anonymous Amit said...

Thanks Noons for such a detailed explanation and clarifying my misconception with regards to 10gr2 and CIO.

Tuesday, June 02, 2009 3:04:00 pm  
Anonymous Patty C. said...

Thanks for this post, it is very timely for us as we are just about to begin testing AIX large pages. One note.. there is a bug in 10.2.0.4 that keeps it from using large pages in AIX. You have to apply patch 7226548 after upgrading from 10.2.0.3 to 10.2.0.4 to get the instance to use large pages.

Saturday, June 06, 2009 4:49:00 am  
Blogger Noons said...

Wow! Thanks for that info, Patty! We'll start looking at 10.2.0.4 soon and that is very handy to know.

Sunday, June 07, 2009 10:18:00 pm  
Anonymous fab said...

Regarding CIO mode that should be used by oracle when filesystemio_options=setall, have you verified that oracle actually open the datafiles in CIO mode ?
I have a 10.2.0.4 instance with filesystemio_options=setall, the datafiles are on a jfs2 filesystem, but when I check with lsof (lsof +fG), I can see that the datafiles have not been opened with the FCIO flag :
COMMAND PID USER FD TYPE FILE-FLAG DEVICE SIZE/OFF NODE NAME
oracle 290868 oracle 17u VREG 0x4400003;0x1 57,1 21474844672 72 /data01 (/dev/lvdata01)
0x4400003=FREAD+FWRITE+FDATASYNC+FLARGEFILE

If I force the filesystem to be mounted in CIO mode, I can see flags=0x4400083 and 0X80=FCIO (found in /usr/include/fcntl.h)

Thursday, September 10, 2009 3:04:00 am  
Blogger Noons said...

That looks awfully like either a problem with 10.2.0.4 - wouldn't surprise me one bit... - or a problem with lsof itself.

lsof doesn't work well with file systems in our SAN, for some reason.
I just truss the dbwr process between startup nomount and alter database mount/open and this is the result, on 10.2.0.3 and one of my test dbs:

$truss -a -t open -p 12390536
open("/u31/oracle/oradata/dgprod/control01.ctl", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 14
.
.
.
open("/u31/oracle/oradata/dgprod/system01.dbf", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 16
.
.
.
open("/u31/oracle/oradata/dgprod/undotbs01.dbf", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 17
.
.
.
etcetc.

As you can see, the O_CIO option is there in the open call and that's what opens files in CIO.

Would you be able to check your environment with truss as well?

I just start the db in nomount, look up the PID of dbwr with ps, use that for "-p" in truss, then go back and alter database mount, alter database open.

Another interesting thing is that all db files are first opened in read-only mode, then closed and re-opened in read-write. Dunno why, but it seems to be the norm.

Thursday, September 10, 2009 1:20:00 pm  
Anonymous Fab said...

ok thanks for your hint, I will use truss at next restart of the database and post the result here. Unfortunately I do not have a db available now for testings.
Have you tried to check with lsof the flags of files opened ?

Friday, September 11, 2009 5:02:00 am  
Blogger Noons said...

Yes, I have. It doesn't work at all in our system: only lists the major devices, doesn't understand anything below that. Probably due to us using EMC and Powerpath. Didn't bother figuring out why because truss gave me the data I needed.

Friday, September 11, 2009 9:15:00 am  
Blogger M.Toscano said...

hi ,

i believe the tuning written is so good ,

but i'm trying to understand if is possible to do just the modify for lgpg_regions and lgpg_size ,

or if it is necessary after these values to do the modify even for the other kernel values ... minperm maxperm etc.etc ...
and so on ...


and in particular i wouls like to know if is necessary to do the actions for the env of oracle , for the capabilities etc etc , or not , usually i don't look that -

must i do before the tuning of env of oracle ,and after i can do the new parameters of kernel ?

probably yes -

first is necessary of course to shutdown oracle services , or the machine goes in crash probably ,


thnks , kind regards


max

Tuesday, July 20, 2010 1:50:00 am  
Blogger Noons said...

Thanks, Max.
Yes, Oracle must be shutdown for all this.

The changes to minperm, maxperrm are not strictly necessary, just recommended by IBM. The one for v_pinshm is necessary: it enables shared memory to be pinned so it doesn't get paged out.

The enabling of the capabilities for Oracle is needed, otherwise you won't be able to use largepages even if they are configured.

But the environment variables don't need to be there, only if you want sqlplus and all other command line tools to also use largepages.
There is a note in the IBM doco that this is not necessary: you can do quite well with just the SGA in large pages.

But I like to have everything in largepages including Oracle codepages. Hey: it works well, it is VERY fast, I've got the memory, so why not?

Note that this is only recommended after 10g onwards: the code in previous versions is not aligned at largepage boundaries so there is some memory wastage when using 16M largepages.

As to the sequence:
-first, you shutdown Oracle.
-then, make changes with vmo.
-then. change the capabilities for Oracle login.
-then, change the environment variables if that is the case.
-then you restart Oracle and verify if largepages being used with vmstat.

In latest releases of AIX 5.3 and 6 and with Oracle 10.2.0.4 onwards, you can also use 64K pages instead of 16M. 64K pages are pageabls, while 16M are not. Oracle code at that level is compatible with both so it's really a personal choice which is best.

I like the idea of not ever paging out the SGA so I stick with 16M.

HTH

Tuesday, July 20, 2010 6:51:00 am  

Post a Comment

<< Home