上一篇Oracle 10g DataGuard学习(一)中简单介绍了DG通用的技术架构原理及相关概念,本篇将逐步细化DG架构介绍DG中redo传输服务及redo应用机制。该篇内容解释起来可能比较难懂,但会尽力,同时会结合Oracle官方文档加以描述!
2-1 重做传输服务(Redo Transport Services–RTS)
在上一篇的1-2节中简单解释了那么主库产生的重做数据即redo如何传输到备库(无论是物理备库还是逻辑备库),这就是此处要讨论的问题:DG重做传输服务(Redo Transport Services–RTS)协调从主库到备库的重做数据传输过程,同时,主库的LGWR进程将重做数据写入到自己的online redo log中。这样解释可能完全不知所云,因此需要一步一步分析解释。由于主库redo传输到备库的方式有两种:一种是主库通过ARCn进程将在本地产生的归档日志传给备库online redo log或是备库的归档地;另一种是主库通过LGWR进程将主库redo log buffer cache redo或是online redo log传输给备库。备库在接收到主库传来的重做数据后开始应用,通过这种方式保持主备库数据同步。以下为Oracle官方文档中对TRS的说明:
Redo transport services control the automated transfer of redo data from a database destination to one or more destinations. Redo transport services also manage the process of resolving any gaps in the archived redo log files due to a network failure.
Redo transport services can transmit redo data to local and remote destinations. Remote destinations can include any of the following types: physical and logical standby databases, archived redo log repositories, Oracle Change Data Capture staging databases, and Oracle Streams downstream capture databases
Figure 5-1 shows a simple Data Guard configuration with redo transport services archiving redo data to a local destination on the primary database while also transmitting it to archived redo log files or standby redo log files on a remote standby database destination.
上图简单描述了TRS过程,下面将详细分析其内容原理!
2-2 redo传输原理
按照Oracle DG官方文档内容,这一节本应该是Where to Send Redo Data,其主要讲到三方面:①主库redo应传输到哪里使备库可以应用这些redo,其中主要包括redo传输目的地的类型(Destination Types);②如何传输redo到指定的目的地;③设置闪回恢复区。其中①主要讨论在DG或其他架构数据库环境中RTS所支持的destinations,这其中肯定是支持standby databases的;③主要讲如何在DG环境中使用闪回恢复区。而②是我们要重点解释的部分,在DG环境中,主库生成的redo如何传输到备库,使其应用。因此将这一节改成了redo传输原理,以下为官方文档对redo transport的概要解释:
On the primary database, Oracle Data Guard uses archiver processes (ARCn) or the log writer process (LGWR) to collect transaction redo data and transmit it to standby destinations. Although you cannot use both the archiver and log writer processes to send redo data to the same destination, you can choose to use the log writer process for some destinations, while archiver processes send redo data to other destinations. Data Guard also uses the fetch archive log (FAL) client and server to send archived redo log files to standby destinations following a network outage, for automatic gap resolution, and resynchronization.
从如上描述中,可以清楚的知道,在DG中使用两种方式将主库redo data传输到备库,第一种方式为主库使用ARCn进程将主库archived log传输到备库standby redo log files中,另一种方式为主库使用LGWR进程将主库log buffer cache中redo data或online redo log files中redo data传输到备库standby redo log files中,并且当由于诸如网络问题导致备库没有接受到主库传来的archived log而产生archive gaps时,DG将能自动的检测出备库上所存在的archive gaps,然后使用the fetch archive log (FAL) client and server将备库上存在的archive gaps logfiles从主库自动的transmit到备库以便恢复,而且这个操作是DG自己完成无需DBA人为操作。以下为官方文档中对archive gap解释:
An archive gap can occur on the standby system when it is has not received one or more archived redo log files generated by the primary database. The missing archived redo log files are the gap. If there is a gap, it is automatically detected and resolved by Data Guard by copying the missing sequence of log files to the standby destination. For example, an archive gap can occur when the network becomes unavailable and automatic archiving from the primary database to the standby database temporarily stops. When the network is available again, automatic transmission of the redo data from the primary database to the failed standby database resumes. Data Guard requires no manual intervention by the DBA to detect and resolve such gaps. The following sections describe gap detection and resolution.
2-2-1 ARCn传输redo data
上面说到redo data的传输方式,这里详细解释一下这两种redo data传输的原理。下图为DG使用ARCn进行传输redo data到备库的原理图:
从上图我们可以清楚的看到,主库LGWR进程将log buffer cache中redo data写到online redo log(简称ORL),然后由ARC0进程将ORL中的redo data按顺序写入到本地磁盘进行归档,再由主库ARC1进行将本地归档日志通过网络(Oracle Net)传给远端备库的RFS(Remote File Server)进程接收,备库RFS进程把接 收到的主库归档日志中重做条目按照顺序写入到备库的standby redo log files中,备库就可以通过MRP(Medial Recovery Coordinator,Redo Apply方式应用主库redo data时进程)或LSP(Logical Standby Process,SQL Apply方式应用主库redo data时进程)进程将standby redo log files中的redo data应用于备库中,同时备库ARCn进程将standby redo log按顺序归档在本地磁盘中。而在该原理图上,我们还可以看到一点,备库RFS或LSP恢复进程也可以从备库归档日志中读取redo应用于备库,这是为什么呢?按照刚解释的,备库的RFS进程接收到主库传来的redo data先要写入standby redo log files中,随后备库RFS或LSP恢复进程就直接读取standby redo log开始在备库中应用,一般情况下,备库RFS或LSP恢复进程将standby redo log中redo data全部应用于备库中后,备库LGWR进程才会将standby redo log归档,不会出现备库RFS或LSP恢复进程还未将standby redo log中redo data应用于备库前就被归档的情况(根据Oracle数据库体系结构及原理可知)。原因其实很简单,在2-1中已经说到了:当由于诸如网络问题导致备库没有接受到主库传来的archived log而产生archive gaps时,DG将能自动的检测出备库上所存在的archive gaps,然后将备库上存在的archive gaps logfiles从主库自动传输到备库以便恢复。这个过程细化出来,就是图上红圈表出来的一个过程(结合官方文档及自己的理解添加上去),当DG检测到备库存在archive gaps,主库通过ARC3进程将备库缺失的那部分archive redo从主库归档日志中传给备库的RFS进程,RFS进程再将接收到的归档redo先写入到本地,然后备库的RFS或LSP恢复进程就可以读取这部分archive gaps redo应用于备库,从而保证了主备库数据的一致。
有关ARCn传输redo data的几点说明:
①Initialization Parameters That Control ARCn Archival Behavior(初始化参数控制ARCn行为)
在2-2-1小节中,详细讲了ARCn进程传输redo data原理,那么ARCn进程如何去完成这些行为,这就不得不提到初始化参数的作用。其中 参数LOG_ARCHIVE_DEST_n和LOG_ARCHIVE_MAX_PROCESSES决定了DG环境中使用ARCn进行传输redo data的行为。
LOG_ARCHIVE_DEST_n参数:主备库都必须存在,该参数最多可以设置10个,并且每个必须指定LOCATION或SERVICE属性来指定redo data的传输位置及传输方式,对于该参数的用法参考Oracle 官方文档Data Guard Concepts and Administration中第14章LOG_ARCHIVE_DEST_n Parameter Attributes,此处不详细说明。
LOG_ARCHIVE_MAX_PROCESSES参数:该参数指定最大的ARCn进程数,默认为4个,如果数据库的负载压力很大,可以通过修改该参数来提高数据库归档能力。在DG环境主库中,如果使用ARCn进程传输redo data到备库,则ARCn进程数至少需要2个。
②DG环境中ARCn进程传输redo data支持最大性能模式的数据保护
既然提到ARCn进程传输redo data支持最大性能模式的数据保护,这里不得不提一下,在DG环境中数据库运行的三种模式:一种为最大保护模式(Maximum Protection Mode),一种为最大可用性模式(Maximum Availability Mode),第三种为最大性能模式(Maximum Performance Mode)。这里不细说这三种模式的具体含义(在后文相关位置会具体解释),ARCn传输redo data所支持的最大性能模式是指,这种最大性能模式相比于前两种保护模式而言,主数据库性能优于前两种,因为该模式下,不要求主备库数据要保持实时同步,所以采用ARCn进程传输主库archivelog至备库应用,因而不能保证在故障发生时数据不会出现零丢失。而最大保护模式和最高可用性模式对于主备库数据实时同步要求高,因此采用LGWR进程传输主库ORL或是log buffer cache中redo到备库应用保持数据同步。
2-2-2 LGWR传输redo data
上面解释了ARCn传输redo data原理过程,下面解释DG中LGWR传输redo data的方式。LGWR进程传输redo data不同于ARCn,因为LGWR不必等主库将ORL全部归档至本地磁盘后再将每一个archived redo log file一次传给备库去应用,而是当主库log buffer cache中有最新生成的redo data或当主库每产生一组新online redo log,LGWR就将这些redo提交给一个或多个LNSn(Log Network Server)进程通过Oracle Net传输给备库的RFS进程接收,然后RFS再将接收到的redo数据写入到standby redo log file中后,备库再开始应用redo。
LGWR进程传输redo data有两种方式,一种是LGWR SYNC,另一种为LGWR ASYNC。这两种方式虽然都是使用LGWR传输redo data,但是二者在redo data传输机制上还是有差别,同时在DG环境中所对应的数据数据保护模式不同,前者支持最大保护模式和最高可用性模式,而后者不支持。这里粘上官方文档中LGWR传输redo data的说明:
You can optionally enable redo transport services to use the LGWR process to transmit redo data to remote destinations.
Using the LGWR process differs from ARCn processing, because instead of waiting for the online redo log to switch at the primary database and then writing the entire archived redo log at the remote destination all at once, the LGWR process selects a standby redo log file at the standby site that reflects the log sequence number (and size) of the current online redo log file of the primary database. Then, as redo is generated at the primary database, it is also transmitted to the remote destination. The transmission to the remote destination will either be synchronous or asynchronous, based on whether the SYNC or the ASYNC attribute is set on the LOG_ARCHIVE_DEST_n parameter. Synchronous LGWR processing is required for the maximum protection and maximum availability modes of data protection in Data Guard configurations.
为了更清楚的说明这个过程,我们结合官方文档中的原理图来分别说明LGWR SYNC和LGWR ASYNC这两个过程。
①LGWR SYNC
官方文档上对于LGWR传输redo data的说明中,提到了LGWR传输redo data是采用SYNC还是ASYNC方式,是由LOG_ARCHIVE_DEST_n参数设置决定的。对于该参数的用法这里不赘述,可以去官方文档查阅了解。以下为官方文档对于LGWR SYNC的说明:
Example Initialization Parameters for LGWR Synchronous Archival
LOG_ARCHIVE_DEST_1='LOCATION=/arch1/chicago'
LOG_ARCHIVE_DEST_2='SERVICE=boston LGWR SYNC NET_TIMEOUT=30'
LOG_ARCHIVE_DEST_STATE_1=ENABLE
LOG_ARCHIVE_DEST_STATE_2=ENABLE
Specifying the SYNC attribute on the LOG_ARCHIVE_DEST_n parameter is optional, because this is the default for LGWR archival processing. The NET_TIMEOUT attribute is recommended, because it controls the amount of time that the LGWR process waits for status from the network server process before terminating the network connection. If there is no reply within NET_TIMEOUT seconds, then the LGWR process returns an error message.
Figure 5-4 shows a Data Guard configuration that uses the LGWR process to synchronously transmit redo data to the standby system at the same time it is writing redo data to the online redo log file on the primary database:
On the primary database, the LGWR process submits the redo data to one or more network server (LNSn) processes, which then initiate the network I/O in parallel to multiple remote destinations. Transactions are not committed on the primary database until the redo data necessary to recover the transaction is received by all LGWR SYNC destinations.
On the standby system, the remote file server (RFS) receives redo data over the network from the LGWR process and writes the redo data to the standby redo log files.
A log switch on the primary database triggers a log switch on the standby database, causing ARCn processes on the standby database to archive the standby redo log files to archived redo log files on the standby database. Then, Redo Apply (MRP process) or SQL Apply (LSP process) applies the redo data to the standby database. If real-time apply is enabled, Data Guard recovers redo data directly from the current standby redo log file as it is being filled up by the RFS process.
从上图及官方文档中的解释中,我们只需要抓住其中一点就可以明白LGWR SYNC的含义:On the primary database, transactions are not committed on the primary database until the redo data necessary to recover the transaction is received by all LGWR SYNC destinations. the remote file server (RFS) receives redo data over the network from the LGWR process and writes the redo data to the standby redo log files.这里已经说得很清楚了,当主库产生新事务redo data由LGWR交送给LNSn通过Oracle Net传输给备库RFS并将redo data写入到standby redo log files后,主库的该事务才能提交。而后备库就可以通过MRP或LSP进程应用redo data到备库,或由LGWR归档至本地磁盘。
②LGWR ASYNC Figure 5-5 shows the LNSn process collecting redo data from the online redo log files and transmitting it over Oracle Net to the RFS process on the standby database.
上面用了较长的篇幅介绍了LGWR SYNC原理过程,接下来我们解释LGWR ASYNC原理过程。我们先看官方文档的原文描述:
Example 5-6 Initialization Parameters for LGWR Asynchronous Archiving
LOG_ARCHIVE_DEST_1='LOCATION=/arch1/chicago'
LOG_ARCHIVE_DEST_2='SERVICE=boston LGWR ASYNC'
LOG_ARCHIVE_DEST_STATE_1=ENABLE
LOG_ARCHIVE_DEST_STATE_2=ENABLE
When the LGWR and ASYNC attributes are specified, the log writer process writes to the local online redo log file, while the network server (LNSn) processes (one for each destination) asynchronously transmit the redo to remote destinations. The LGWR process continues processing the next request without waiting for the LNS network I/O to complete.
If redo transport services transmit redo data to multiple remote destinations, the LNSn processes (one for each destination) initiate the network I/O to all of the destinations in parallel.
When an online redo log file fills up, a log switch occurs and an archiver process archives the log file locally, as usual.
从上图及官方文档中的秒速中,我们同样只需要抓住其中一点就可以明白LGWR ASYNC的含义:When the LGWR and ASYNC attributes are specified, the log writer process writes to the local online redo log file, while the network server (LNSn) processes (one for each destination) asynchronously transmit the redo to remote destinations. The LGWR process continues processing the next request without waiting for the LNS network I/O to complete.从这段解释中我们就可以看出,LGWR ASYNC传输redo data是从主库ORL读取redo data交送给LSNn后通过Oracle Net传输给备库的RFS进程后写入到standby redo log files,再由备库应用redo data,从而保证主备库数据同步。这种方式下主库不必等备库RFS进程接收到传来的redo data写入到standby redo log files后事务才能提交,从某种程度上来讲,LGWR ASYNC模式下运行的DG环境数据库性能优于LGWR SYNC,但同样该模式下不能保证在主库发生故障后数据不会出现零丢失。
有关LGWR传输redo data的几点说明:
①The transmission to the remote destination will either be synchronous or asynchronous, based on whether the SYNC or the ASYNC attribute is set on the LOG_ARCHIVE_DEST_n parameter.
②The SYNC attribute performs all network I/O synchronously, in conjunction with each write operation to the online redo log file, and waits for the network I/O to complete.
③The ASYNC attribute performs all network I/O asynchronously and control is returned to the executing application or user immediately, without waiting for the network I/O to complete.
④If you configure a destination to use the LGWR process, but for some reason the LGWR process becomes unable to archive to the destination, then redo transport will revert to using the ARCn process to complete archival operations.
⑤If redo transport services transmit redo data to multiple remote destinations, the LNSn processes (one for each destination) initiate the network I/O to all of the destinations in parallel.
⑥If real-time apply is enabled, Data Guard recovers redo data directly from the current standby redo log file as it is being filled up by the RFS process.
这篇主要解释了在DG环境中,主库redo data传输到备库的方式,同时也结合官方文档解释了一些相关的概念,后续还会更新DG环境中redo data应用的原理及过程,还包括如何搭建DG,这其中包括物理和逻辑备库的创建过程,最后还会更新DG的管理维护及简单调优的文章。
PS: 未完待续…