powershell - How to speed up script and do not clog RAM (dir size +2mil files) -

July 15, 2010

i have written novice code which:

runs through master directory (+2mil files , many many sub-folders) filters .tmx files(around 12k files in m.dir) , extracts specific strings. , saves .log file. after done there sub procedures clean log files , merges 1 file. problem left script over-night , got stuck.

i recon it's due dir size. have listed directory files .txt possible script read .txt , process 1 file @ time, maybe way wont take 99% ram after while.

also maybe have other insights in speeding op or merging these procedures one.

get-childitem "masterdirpath\*.tmx" -recurse  |  foreach-object { $content = get-content $_.fullname  #filter , save content original file #$content | where-object {$_ -match '<tu '} | set-content $_.fullname  #filter , save content new file  $content | where-object {!($_ -match '(?:creationid|changeid)="([^"]+)"' -or  $_ -match '(<tuv.+?lang="[a-za-z\-]+">)')} | %{$matches[1]} |get-unique  |  set-content ($_.basename + '_out.log')   }  get-childitem "dir\tologs" -filter *.log | foreach-object { $content = get-content -raw $_.fullname #make 1 line extracted matches $content -replace "`r`n<" ,"`t<"  |set-content $_.fullname  }    get-childitem "dir\tologs"  -filter *.log |  foreach-object { $content = get-content $_.fullname  #filter , save content original file log file $content | where-object {$_ -match '^.+$ '} | sort | get-unique | set- content $_.fullname }   $path = "dir\tologs" $out  = "dir\tologs\output.txt"  get-childitem $path -filter *.log | % { $file = $_.name get-content $_.fullname | % {     "${file}: $_" | out-file -append $out } }

update: sample input

these .tmx files vary in size 1mb 2gb directory size around 1tb. , files there can few mb upto few gb. script runs fine on small directory 50 tmx files 1-100mb.

<?xml version="1.0" encoding="utf-16"?> <!doctype tmx system "tmx14.dtd"> <tmx version="1.4"> <header creationtool="memoq" creationtoolversion="7.0.68" segtype="sentence"  adminlang="en-us" creationid="lsmall" srclang="en-us" o-tmf="memoqtm"  datatype="unknown"> <prop type="defclient"> </prop> <prop type="defproject"> </prop> <prop type="defdomain"> </prop> <prop type="defsubject"> </prop> <prop type="description"> </prop> <prop type="targetlang">hu</prop> <prop type="name">28807_project_hu</prop> </header> <body> <tu changedate="20151104t174128z" creationdate="20150929t180844z" **creationid="pmccrory"** **changeid="lsmall">** <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">no</prop> <prop type="x-document">node_data_en_final.xml</prop> ***<tuv xml:lang="en-us">***   <prop type="x-context-pre">&lt;seg&gt;biomarkers , integrated solutions&lt;/seg&gt;</prop>   <prop type="x-context-post">&lt;seg&gt;novel therapeutic agents have fast onset of action, safety , tolerability profiles , address common co-morbidities (for example, anxiety , substance abuse) &lt;ph type='fmt'&gt;{}&lt;/ph&gt;&lt;it pos='begin'&gt;&amp;lt;ul&amp;gt;&lt;/it&gt;&lt;/seg&gt;</prop>   <seg><it pos='begin'>&lt;ul class=&quot;inline&quot;&gt;</it></seg> </tuv> ***<tuv xml:lang="hu">***   <seg><it pos='begin'>&lt;ul class=&quot;inline&quot;&gt;</it></seg> </tuv> </tu> </body> </tmx>

output after procedure:

 aabb-cor-09_master_de_out.log: 6293 syb <tuv xml:lang="en-us">    <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: ad    <tuv xml:lang="en-us">    <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: agentile  <tuv xml:lang="en-us">    <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: align!    <tuv xml:lang="en-us">    <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: angelika  <tuv xml:lang="en-us">    <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: asedr <tuv xml:lang="en-us">    <tuv xml:lang="de-de">

use 1 pipeline entire processing instead of 4 separate passes.
use string operators -join , -split instead of writing , reading same file
use [regex] class , matches method extract tokens want.

$rx_extract = [regex](     '(?<=(creationid|changeid)=")[^"]+(?=")|' +     '<tuv.+?lang="[a-za-z\-]+">' ) # unwanted parts suppressed output via look-behind , look-ahead  get-childitem (join-path $tmx_dir *.tmx) -recurse | foreach {     $_.fullname + ': ' + (         ($rx_extract.matches((get-content $_ -raw)).value | get-unique         ) -join "`n" -replace '\n<', "`t<" -split "`n" -ne '' | sort -unique     ) -join "`t" } | out-file "dir\tologs\output.txt"

not tested extensively. use example.

Search This Blog

RT

powershell - How to speed up script and do not clog RAM (dir size +2mil files) -

Comments

Post a Comment

Popular posts from this blog

html - How to custom Bootstrap grid height? -

javascript - pass values from mssql to views in node -

python - Getting 405 error while trying to download nltk data -