Topic: Update: Sporadic error when opening files in General

wahorger

Posts: 1273 Morrison, CO, USA

Back to Top

23 Jan 2015 5:50 #15415

While I have made progress on a work-around, I have yet to understand the actual cause. It appears to be a combined interaction between the Operating System (Windows 7, 64-bit, Pro, SP1) with a McAfee Virus Scan component added. But, it is not all the Virus Scanner. That is the most unusual part.

The original problem was an unexpected IOSTAT= error code (10005, or 10002) returned from an OPEN. Something was sporadically interfering with the OPEN, and one of the suspects would be McAfee. The files being written were in a folder underneath the Documents folder with a .DAT extension. The cases studied are with McAfee Real Time Scanning (RT) enabled or disabled, and the file being written is either on a MAPPED drive (using SUBST to perform the logical mapping) or on a physical partition. The MAPPED drive is as described, a folder within the Documents folder. The partition drive is the logical F: drive on my system, a partition of the main drive.

If the McAfee Real-Time scanning is enabled, then these errors will occur. The suspicion is that the RT scanning feature gets in the way of the OPEN, possibly intercepting the call prior to the OS getting it. In any case, the occurrence of 10005 errors is about 30% of the time on a MAPPED drive and about 0.5% on an un-Mapped drive. In the cases shown below, there are no CLOSE errors. However, in other testing over a period of a few hours, there is an occasional error in the CLOSE operation, with a 10002 or 10005 error. Occasionally, the error code at CLOSE was seen to be a 32 (from the HELP file: 'Second argument to MOD is zero').

Mapped Drive
           Event Name   IOSTAT  Cur. Time Open  Del  Wrt Clos Count
+    Just checking in        0 090914.852    0    0    0    0    0
+   open (new) failed    10005 090920.862    1    0    0    0    8
+   open (new) failed    10005 090923.738    2    0    0    0    9
+   open (new) failed    10005 090928.402    3    0    0    0   12
+   open (new) failed    10005 090933.127    4    0    0    0   15
+   open (new) failed    10005 090937.801    5    0    0    0   18
+   open (new) failed    10005 090944.177    6    0    0    0   23
+   open (new) failed    10005 090947.035    7    0    0    0   24
+   open (new) failed    10005 090949.870    8    0    0    0   25
+   open (new) failed    10005 090952.762    9    0    0    0   26
+   open (new) failed    10005 090956.436   10    0    0    0   28
+   open (new) failed    10005 091003.523   11    0    0    0   34
+   open (new) failed    10005 091007.143   12    0    0    0   36
+   open (new) failed    10005 091010.751   13    0    0    0   38
+   open (new) failed    10005 091019.889   14    0    0    0   46

14 failures/retries in 46 successful opens



Unmapped Drive
           Event Name   IOSTAT  Cur. Time Open  Del  Wrt Clos Count
+   open (new) failed    10005 094152.523    1    0    0    0 15746 Transition from RT Scan Off to ON
+    Just checking in        0 094202.403    1    0    0    0 15872
+    Just checking in        0 094219.090    1    0    0    0 16128
+    Just checking in        0 094235.840    1    0    0    0 16384
+   open (new) failed    10005 094240.040    2    0    0    0 16440
+   open (new) failed    10005 094243.089    3    0    0    0 16458
+   open (new) failed    10005 094254.081    4    0    0    0 16598
+   open (new) failed    10005 094256.198    5    0    0    0 16600
+   open (new) failed    10005 094258.690    6    0    0    0 16608
+    Just checking in        0 094302.822    6    0    0    0 16640
+   open (new) failed    10005 094304.614    7    0    0    0 16665
+    Just checking in        0 094322.231    7    0    0    0 16896
+    Just checking in        0 094338.897    7    0    0    0 17152
+   open (new) failed    10005 094348.565    8    0    0    0 17299

8 failures/retries in 1553 successful opens

I also have a Windows 2000 machine, no virus software installed. In this environment, it matters not if the drive is mapped or unmapped, there are no errors.

So, there appears to be some interaction on mapped drives. This may also explain the occasional error when rebuilding code encountered under Plato where the error is 'Cannot write error log' (or something close to that; it has been a while) occurs. Build again, and the error seldom re-occurs.

The code used to run these tests is shown below. I would invite others to try this an report their results along with system details. I will run the test program in SAFE mode to see if a 'Clean' system has any of these issues on a logical drive or mapped drive.

!FTN95 application...
PROGRAM main
 integer print_every
 integer failed_open,failed_delete,failed_close,failed_write
 character iorecord(194)
 character*8 ddate
 character*10 ttime
 character*5 zzone
 integer values(8)
! Test program to open a unformatted sequential access file of random data, then write several records
! then close it and try again. Each failure to open will be logged along with the error code and date/time
!
 print_every = 0
 failed_open = 0
 failed_delete = 0
 failed_close = 0
 failed_write = 0
 call date_and_time(ddate,ttime,zzone,values)
 print *,'Starting the test at: ',ddate(1:8),' ',ttime(1:10)
 open(unit=15,file='testfile.dat',access='sequential',iostat=icode,err=12000,&
 form='unformatted',recl=194,status='replace',share='compat')
 do i=1,30
 call randomfile(iorecord,194)
 call date_and_time(ddate,ttime,zzone,values)
 write(15,err=13000)iorecord
 end do
 call date_and_time(ddate,ttime,zzone,values)
 close(unit=15)

1000 continue
num_opened = 0
2000 continue
 call date_and_time(ddate,ttime,zzone,values)
  open(unit=15,file='testfile.dat',access='sequential',iostat=icode,err=15000,&
 form='unformatted',recl=194,status='replace',share='compat')
 num_opened = num_opened+1
 do i=1,30
   call randomfile(iorecord,194)
   call date_and_time(ddate,ttime,zzone,values)
   write(15,err=13000)iorecord
 end do
3000 continue
 call date_and_time(ddate,ttime,zzone,values)
 close(unit=15,iostat=icode,err=20000)
 icode = 0
 if(mod(print_every,256).eq.0) write(*,99000)'Just checking in',icode,ttime(1:10),failed_open,failed_delete,&
   failed_write,failed_close,print_every
 print_every = print_every+1
 go to 2000
 stop
11000 continue
 failed_delete = failed_delete+1
 write(*,99000)'Delete failed',icode,ttime(1:10),failed_open,failed_delete,failed_write,failed_close,print_every
 99000 format('+',A20,1x,I8,1X,A,I5,I5,I5,I5,I5)
 go to 2000
12000 continue
 failed_open = failed_open+1
 write(*,99000)'Initial open failed',icode,ttime(1:10),failed_open,failed_delete,failed_write,failed_close,print_every
 stop
13000 continue
 failed_write = failed_write+1
 write(*,99000)'write failed',icode,ttime(1:10),failed_open,failed_delete,failed_write,failed_close,print_every
 go to 3000
15000 continue
 failed_open = failed_open+1
 write(*,99000)'open (new) failed',icode,ttime(1:10),failed_open,failed_delete,failed_write,failed_close,print_every
 call sleep1@(2.0)
 num_opened = 0
 go to 2000
20000 continue
 failed_close = failed_close+1
 write(*,99000)'file close failed',icode,ttime(1:10),failed_open,failed_delete,failed_write,failed_close,print_every
 call date_and_time(ddate,ttime,zzone,values)
 close(unit=15,iostat=icode,err=20000)
 go to 2000
 
END PROGRAM main
 subroutine randomfile(iorecord,length)
 character*1 iorecord(length)
 do i=1,length
   iorecord=char(int(255.*random@()))
 end do
  return
  end

wahorger

Posts: 1273 Morrison, CO, USA

Back to Top

23 Jan 2015 6:15 #15416

In Safe mode, McAfee loads, but was not enabled for RT scanning. In 1280 attempts for both mapped and un-mapped drives, there were no errors detected.

So, whether it is McAfee or not, SOMETHING is getting in the way. But it would appear that a simple retry scheme will work just fine.

JohnCampbell

Posts: 2526 Sydney

Back to Top

24 Jan 2015 10:22 #15424

These types of errors can be due to the Microsoft OS. At times, the windows file handles are not immediately released and you can get a file in use error. As Paul has suggested, using a sleep@ delay and trying again will often fix the problem.

It would be good to get what 10005 or 10002 mean and where they are coming from.

Your tracking and reporting these errors is a good approach.

John

wahorger

Posts: 1273 Morrison, CO, USA

Back to Top

28 Apr 2015 1:34 #16250

I have had even more occurrences of odd file open/close behavior that I thought I'd share.

A very old section of the code opens a file reads several records from it, then closes the file. Usually, this is when the user interacts with the program. However, it is a 'generic' routine that can be used widely. Its functioning is to prevent a conflict with another user who might need access to this file, so keeping the file opened for a short period of time makes sense. And this is where the problem occurs.

I call the routine multiple times, sequentially, as I step automatically through a set of inputs to the routine. So what I have, in essence, is a rapid sequence of open/read/close/open/read/close/... And, sometimes, I'll get a 'File in Use' error (134). I am the only user on the system, and I'm only running this one process that accesses the file.

My suspicion is that once the request to the system to close the file is made, it is possible (perhaps likely?) that a different processor thread is invoked to accomplish this task. While this is proceeding, my process then requests that this file be re-opened. And, this is where the conflict arises, and I get the 'File in Use' error.

I went back to the common file open/close routine and verified that I was waiting if an error was detected, and to retry the operation. I did notice that I was using SLEEP1@, and changed that to SLEEP@. I also increased the number of retries.

Now, I can see when the program stutters (almost a pause) while it is going into retry mode. But it is ultimately successful each time. I've run a number of cases in an attempt to have it fail and have not had an issue.

Some had suggested that a delay be put in place to prevent this, and that had been in place, but either not enough delay or not enough retry occurrences. Just like most Windows software, there appears to be a non-deterministic interaction between the applications and the underlying OS. Frustrating, but it can be managed.

I also realize that I should probably have two versions of this function. One to use in a 'one-of' situation, and the other should leave the fileopen until it is no longer needed. That is the better technique anyway.

Sigh. One more piece of code to rewrite!!

Hope this helps anyone facing the same situation.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

29 Apr 2015 1:11 #16255

It will take far longer, and be much more obstructive to other users (if there are any), to follow your strategy - just get on and read the whole file in one go. Windows will cope with the other users' requests.

If the file is in plain text, it is possible for a user to edit it, and then some checking on file validity is required as you read it just in case they banjaxed it. If the file is only created in your program, and there is never any question of user-editing, then you could write it when you last updated it and re-read it when you want it again as unformatted. In such a case, it will read faster anyway (but who cares about milliseconds?) and doesn't need any checking on the fly - if at all. You can read thousands of lines in the blink of an eye

If you absolutely must go through repeated OPEN - READ - CLOSE cycles, you have to remember that flushing the disk cache is asynchronous with your program, and if your file is being copied out of the disk cache onto the hard drive you* will* get those file in use errors. With the greatest of respect, expecting a system like Windows to do everything sequentially and in time with your Fortran code is to believe that there have been no advances in operating systems since around 1957! (Or perhaps even earlier).

If you respond to an error by retrying immediately afterwards you will probably catch the same issue - until eventually it works (as you say it does). Sure, you can go to sleep and try again later, but if your experience is that this produces an appreciable delay in your program's responsiveness, you ought to let the user know via a dialog box.

Meanwhile, you have irritated the little imp in the computer who copies the bytes longhand into a ledger ... Hang on, that's the spirit of Terry Pratchett being channelled in my reply!

Eddie

wahorger

Posts: 1273 Morrison, CO, USA

Back to Top

29 Apr 2015 2:03 #16258

Yes, Eddie, the program does expect things to be done by the time the system returns control to it, and that is indeed the essence of the problem.

While not from 1957, this code in particular dates from the early 80's, when things were more 'predictable'.

Even so, why does the 'C' version code (run from the command line, and as a 32-bit program), which does this process in the same way, not have a problem? Makes me more than wonder.

And, I still get the sporadic errors with 5 digit long error numbers even when opening a file just once.

Still trying to get a real answer to this class of problem that occurs in my development environment as well as the user machine(s).

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

29 Apr 2015 2:43 #16259

You might have more luck with an SSD. You'll definitely have less luck with a NAS or external hard drive.

Most computers are now so fast that we forget the timing errors of early windows versions.

If C (+, - , #, ++ or any other variety) works for you then why don't you link the C routine. Does it work with Salford/Silverfrost C - which is a facility that few people using FTN95 even know is there?

Eddie